22 Comments

"We’re most definitely optimizing, as we really care about these competitions on dataset benchmarks. But our algorithm is some sort of massively parallel genetic algorithm, not a clean, rigorous, and beautiful convex optimization method."

See? This illustrates my point about Darwin being the synthesis of Hume (messy, collective lurching about in the space of algorithms, objectives, etc.) and Kant (individual engineers finding a model on the internet and tweaking a parameter or two).

Expand full comment
author

I'm coming around. Partially because it means I need to blog about Lewontin.

Expand full comment
Sep 6, 2023Liked by Ben Recht

What's maybe ironic about that list of methods from Duda and Hart is that it although it's a long list, it doesn't appear to include a standard statistics-style logistic regression fitting procedure (i.e. maximum-likelihood a.k.a. cross entropy, I believe typically done in batch mode). So you could easily add that to the list.

Expand full comment
author

Right. And algorithmically, it would look exactly like table 1.

a_{k+1} = a_k + rho_k y_k

rho_k = c*(1-tanh(a_k^t y_k))

Expand full comment
Sep 6, 2023Liked by Ben Recht

“Graduate student descent” - that’s hilarious! Ben, I can see your point about which optimization algorithm not mattering, but what about ADMM? I seem to remember you had a nice paper using ADMM to decode linear codes.

Expand full comment
author

Oh, there's a very important modifier that maybe I need to over-emphasize:

optimization does not matter *in machine learning*. Optimization is great in so many other contexts. But you need a clear specification for the costs and constraints (we have neither in ML).

Expand full comment
Sep 6, 2023Liked by Ben Recht

Coming from Number Theory and optimization I was initially perplexed by a lot of what I saw in machine learning (at least in how to design a model). I came to see that much of it was just to try a bunch of different things and see what worked best. I would ask, in vain, for criteria to use to set the number of layers, of width, etc. maybe they’re there and I missed it.

Expand full comment
author

Haha. I know. And people always say to me "engineering is messy." But everyone who comes from other engineering disciplines and looks at machine learning is horrified by what we do.

Expand full comment

Yes, in ML we care about test accuracy (or learnt features) rather than training loss, but that doesn't mean it's pointless to study the theory of optimization. It just means that deep learning calls for a different kind of optimization theory -- one focused on rigorously understanding the _local dynamics of optimization algorithms_ at arbitrary points in weight space, rather than on global convergence rates (probably impossible for deep learning) or asymptotic analyses near a minimizer (useless, since deep learning happen doesn't happen near minima). Yes, there are many cases where optimizer A trains faster than optimizer B, yet generalizes worse (e.g. your 2017 paper about Adam), and in each of these cases we should seek to precisely understand why this occurs. There is a lot of compelling evidence (which I would be happy to link to) that many of these situations can be explained by the different ways in which different first-order optimization algorithms implicitly control the curvature along their optimization trajectories. So we should reverse-engineer today's inscrutable SOTA pipelines in order figure out: (1) what different first-order algorithms do; (2) what is the 'right' thing to do, so that today's hacks can become tomorrow's principled algorithms.

Expand full comment
author

This project of reverse engineering SOTA feels too much like adaptionist evolutionary theory. Sometimes when we're obsessed with competitive testing, stuff gets propagated for no good reason.

What if there is no rigorous reason that one method works better than the other on one benchmark? Or even in an evaluation of benchmarks? What if it’s just a tangled mess of cultural memes, technical debt, and measurement artifacts? How could we ever rule that out?

Expand full comment

I'm largely sympathetic to the gist of your 'what if there is no rigorous reason' question. Also, given that there are no-free-lunch theorems both for search/in-sample-optimization and for generalization, there's a sense in which it's known that there is no rigorous reason which can hold across all problems. The gist of what you're getting at is partly why I soured on going into academia after getting an ML PhD in the 90s and doing ok in grad school. I was burnt out, maybe took those NFL theorems too much to heart, and felt like the whole field is just a bunch of educated guessing and fancy curve-fitting. For instance, it bugged me back then (and still does now) that stochastic gradient descent often does better than batch. I'm aware of some of the offered reasons (e.g. avoiding local minima, and I think there are some fancy papers which try to explain this with some rigor). I still just...aesthetically don't like that SGD does better than batch, given that your goal is out-of-sample fit, which should be best approximated by the overall, batch fit across the whole training set.

With that said, the space of problems which humans care about is not random, and there are clearly patterns to be observed about what works empirically when carefully tested out-of-sample. Although there are various challenges in industry applications that are not captured in Kaggle, I do think Kaggle provides some value in providing a record of what has worked best for each problem (where many contestants worked on it, with a bit of real skin in the game). Patterns across problems do emerge, e.g. gradient boosting machines usually doing well for tabular problems, CNNs for vision, etc.

If we suddenly started caring about some totally new, weird space of supervised learning problems that were unrelated either to computer vision, or speech, or LLMs, or tabular problems which are often either largely monotone or maybe monotone plus a few if-then-ish rules......like, I don't know, maybe something with very high-frequency patterns to be detected...then maybe no existing model or training algorithm would do well and we'd have to start almost from scratch. However, generally, there's some stability to the space of supervised learning problems we care about. I agree with what (I think) you're getting at, though. There's a sense in which the whole ML field is fundamentally empirical (let's see what works) and/or a matter of slowly arriving at the 'right' priors which happen to match the world we live in and the problems we care about.

Expand full comment
Sep 6, 2023·edited Sep 6, 2023

But DOES SGD actually do better than batch? Here's the thing...for fitting high-dimensional linear models (random features approximated GP) with large condition numbers, preconditioned conjugate gradients (with a Nystrom preconditioner) eats Adam, AMSGrad, SVRG, SAG, all the modern versions of SGD for lunch. Empirically, it really does -- I lost some time recently trying to find a variant of SGD that could compete with preconditioned CG for this very purpose. This is unsurprising because Adam and AMSGrad are in some sense using a diagonal approximation to the Hessian, which is a very bad approximation in most situations where the Hessian is ill-conditioned. And you can easily implement CG in such a way that you only load one batch of data into memory at a time, so even though you're computing the full gradient, the memory cost is pretty modest. Even if you precondition SVRG with the same preconditioner you're using for CG, it's still pretty bad compared to preconditioned CG, partly because SVRG requires you to do extensive learning rate tuning, CG does not.

Likewise, if you wanted to do some simple 1d curve fitting (let's say fit a five-parameter logistic to some ELISA data a biologist has collected), no one would use SGD, because it won't work as well as Levenberg-Marquardt or L-BFGS.

So this for me at least seems to raise some interesting questions about Adam SGD. We take something that works poorly for "simple" problems (1d curvefitting, high-dimensional linear regression) and use it for billion parameter models and suddenly...it works great! What peculiarity of the deep learning loss function landscape makes Adam work so well for this particular type of model? Have we selected the deep learning architectures that we selected because they worked well with SGD? If we had some other (possibly better) optimization approach, would we have selected other architectures instead?

Of course, you could argue that an SGD-trained model will generalize better, that SGD is in effect providing some regularization. Depending on context that might be true; but how would you evaluate this given that many alternative optimization algorithms can't be scaled to fit a deep learning model, at least not easily?

Expand full comment

For sure, I can believe SGD loses badly for certain types of high-dim linear models or e.g. 1d curve fitting. While I wouldn't say I'm a training alg/optimization expert, I think SGD-type methods generally tend to win when the model is very nonlinear in the fitted parameters and tends to have local minima, e.g. deep neural nets or matrix factorization. For sure, you had to go SGD for matrix factorization for the Netflix Prize, which I was heavily involved in as a competitor.

More generally, the observation 'SGD loses in this scenario but wins in that scenario' fits into the general theme we've discussed, i.e., that ML is heavy on 'seeing what works' and can be light on reliable, universal first principles.

Expand full comment

Good point -- as you say, like many other things in ML, this seems to be a situation where practitioners have come to a consensus about "what works" in which scenario based on trial and error. There's nothing necessarily wrong with that -- it's enabled considerable progress on hard problems like protein structure prediction -- but I think it also arguably slows down progress, because it means that thousands of GPU-hours are being spent trying out new variations on existing neural network architectures, where only some of these variations will improve on existing methods (I like how the post describes this process as "graduate student descent" or "github descent"). Even if you've already selected an architecture, the time required to experiment with different hyperparameters (number of layers, size, learning rate schedulers, dropout, various modifications to the architecture etc etc etc) can be considerable. To be fair, it is often possible to get reasonably good results quite quickly using "sensible" defaults, where "sensible" means "this works well in my experience", or "this is how they did it in the one paper and it worked really well for them". But that begs the question of whether there are alternatives that would work much better than the "sensible defaults" and if so how to find them without using some variant on "github descent". I like the way the post describes some of the other limitations and drawbacks of the "github descent" approach to research in ML.

I don't have a solution to these problems unfortunately; largely just agreeing with what I think some of the other commenters here are saying. I think we would all love to have a more principled way to pick e.g. neural network architectures -- it would certainly save a lot of compute and time. Not to mention another to me fascinating question -- is there some more efficient way to solve some of the problems we are now solving using billion-parameter neural networks? Ten years from now, with the benefit of hindsight, is the current LLM craze going to look like a great leap forward, or a weird interlude in which tech companies spent fortunes on GPUs while overlooking simpler / more efficient solutions? If simpler solutions are possible, is there a way to find them without the time-consuming process of "github descent"?

Expand full comment

I didn't say it was going to be easy! But there isn't really any other way forward.

I always have the following metaphor in my head. Suppose that modern civilization makes contact with some pre-modern society deep in the rainforest. The medicine men in this pre-modern society all say that the root of the Tanyapana tree can reverse cancer. One possible reaction is for the respected scientists to arrogantly say "oh, modern science has no theoretical justification for reversing cancer, so it can't be true." A different reaction is to take seriously the claims that the thing works, and to carefully isolate the active ingredient. Once isolated, the active ingredient can be turned into a proper medicine.

We need to take the SOTA pipelines, put them under a microscope, and find the active ingredient. Yes, they probably contain a lot of cultural memes, technical debt, etc alongside the active ingredient, but there has to be an active ingredient.

Expand full comment

Is it not begging the question to assume that the root has an active ingredient that can be isolated and turned into a proper medicine to begin with? In the language of your metaphor, there could be some gene that the prehistoric people still have but which we have discarded through evolution - studying the root will never uncover this.

Expand full comment

I'm blanking on this one, which paper are you talking about?

> Much to Duda and Hart’s chagrin, I’m sure, the most cited paper in all of optimization would end up being a 2014 paper that proves an incorrect theorem with an unparsable convergence bound for linear discriminant functions.

Expand full comment
author

Damek, are you trolling me or are you serious? Because I wish I didn't know this was the case:

https://scholar.google.com/scholar_lookup?title=Adam:+A+method+for+stochastic+optimization&author=Kingma,+D.P.&author=Ba,+J.&publication_year=2014&journal=arXiv

Expand full comment

Oh, duh, wasn't trolling, just didn't mentally mark Adam as an "optimization paper," but an "optimization in ML" paper.

Expand full comment

A historical note: Pegasos is SGD on the SVM objective function, but clearly that it is not the reason why the paper was influential.

Pegasos showed that for the first time finite-time O(1/T) convergence rate with SGD on the SVMs. This was important for three reasons: 1) people at that time were optimizing the SVMs in the dual with variants of SMO; 2) finite time guarantees for SGD for strongly convex functions were proven only one year before at COLT 2006; 3) finally, the most important reason was that Pegasos was actually fast, in practice, on real problems, and with a choice of the learning rate directly given by the theory. In fact, I was still getting SOTA results with multi-kernel learning on computer vision tasks with variants of Pegasos in 2010-2011.

So, Pegagos was the first algorithm to be able to compete with SMO, in both speed and in the freedom from choosing learning rate schedules. And this was given completely by the theory.

This also explains why none of the SGD papers from 2007 and later cited the old papers, not because people were "reinventing SGD", but because the old papers were proving asymptotic bounds under very weak condition on the learning rate (exactly the third column in your table). In fact, Leon Bottou in that years was preaching about SGD, exactly citing the old papers, but without any clear choice of the learning rate his algorithms did not really convince the applied people for long time.

Expand full comment
author

I feel like you’re proving my point here.

If we actually cared about solving the SVM, then a 1/T algorithm would be a negative result. We'd want a faster algorithm

But at NIPS 2007, Leon and Olivier Bousquet reminded us that we didn’t really care about optimizing to machine precision because generalization error was going to be 1/T at best. We knew this about generalization error long before COLT 2006.

But you might say, we had no algorithm for linear classification that could get a 1/T generalization error and fast convergence rates without parameters in 2007. This is also not the case.

The Russians proved in the 1960s that the Perceptron achieved the 1/T generalization rate. And the proof is basically a 2 line leave-one-out argument. I learned this from Sasha Rakhlin (http://www.mit.edu/~9.520/fall18/slides/Class14_SL.pdf)

OK, you now might raise that this assumes separable data. But if your data isn’t separable, you can only get a 1/sqrt(T) generalization rate anyway. A 1/T algorithm would be overkill in this case, and you might as well do a 1/sqrt(T) algorithm.

I'm just saying that all of the pieces were there already in 2007. We weirdly love to reinvent the wheel because ML research has worse short-term memory than goldfish. (See, for instance, the Best Paper Award at ICML 2023).

Expand full comment

I was actually trying to answer only the misunderstanding on Pegasos, nothing more... It's like a hobby for me: trying to teach history of ML on internet ;)

Regarding these new points, faster rates are possible on SVM objectives and people did know about it even in that years. But linear rate does not mean anything without considering complexity per iteration. So, given that we knew that optimizing the training error to death is useless and given the larger and larger datasets (I am sure you remember the "big data" period), Pegasos was still the right choice.

The Perceptron example is cute, I also teach it in my Intro to ML class, but it did not really work well as an SVM. I am pretty sure of this one: I wrote a paper on Multi-kernel learning where you start with a Perceptron-like update and then you switch to an SVM (https://ieeexplore.ieee.org/abstract/document/5540137), and you do gain in performance when you switch.

Let me actually be more precise: the Perceptron did work in some fields, like NLP, where the structured Perceptron was used, but not everywhere, for example not in computer vision.

So, Pegasos was still the best choice at that time, even over Perceptron.

Overall, I might disagree here that Pegasos is a reinvention of the wheel, sorry. ML does reinvent the wheel all the time, not doubt on it, but not in the Pegasos case!

But on your actual point of the post, I think we agree that faster convergence on training error is useless in general and deterministic algorithms are more then useless in ML. But I think you could still attend a convex opt class that focuses on stochastic convex optimization theory. Less Lagrangian, more first-order algorithms. It is not so crazy, Beck has an entire book on it, no? Why I think it is useful? Because we are finally in the asymptotic regime that Leon and Olivier were telling us in 2007. Essentially, their paper was heavily cited without being really relevant, till now.

Pegasos does not apply anymore (because the L2 regularizer is too small) and I believe that we can design something better than Polyak's averaging in this regime. Let's be optimistic, we might even be able to beat the giant genetic algorithm that produces clones of Adam.

On a more general note, I believe the real reason why optimization is becoming less and less relevant in ML is because the community decided that it is easier to make the objective function more benign by changing the architecture (like with batchnorm and resnets) than by designing new optimization algorithms. Who knows, they might be right. But knowing optimization theory might be still useful to understand what they are doing, instead of claiming that the "gradients flow through the skip connections".

Expand full comment