Highly optimized optimizers

Ben Recht

Oct 2

Justifying a laser focus on stochastic gradient methods.

Read →

13 Comments

Uday Singh Saini

Oct 2

https://arxiv.org/abs/2507.07101? This is a recent work showing SGD isn't so bad.

Expand full comment

Matthew

Oct 2

If ADAM is a bad idea, does that mean Newton's method is a bad idea, too? Assuming I understand correctly, ADAM is the first-order shadow of Newton's method arising from a diagonal approximation to the Hessian. I've always viewed Newton's method as the gold standard and SGD as the necessary compromise because Hessians are too big for models with many, many parameters.

Expand full comment

Reply (1)

Ben Recht

Oct 2

ADAM is neither a Newton method nor a quasi-Newton method. It's an entirely different beast. You can do stochastic Newton and quasi-Newton methods and they definitely work well in many machine learning settings. But the adaptive scaling ADAM uses is its own (not particularly well motivated) beast.

Expand full comment

Carl Boettiger

Oct 2

Why minimize *average* prediction error though? Surely ML still cares about strictly proper scores (a la Gneiting & Raftery https://doi.org/10.1198/016214506000001437). (Maybe I'm just reading this too literally and this just means minimizing some strictly proper score in the probabilistic case, right?)

Same premise that we're always minimizing something, it's just not always obvious what the 'right' choice is.

Expand full comment

Reply (1)

Ben Recht

Oct 2

I don't think I fully understand what you mean in the first part of your comment. When you minimize a proper score, you usually are minimizing an average, no?

But I agree that what is "right" is always vague. Which is why I spend so much time at the beginning of the class exploring what it means to minimize averages in the first place.

Expand full comment

Reply (1)

Carl Boettiger

Oct 2

right, you would minimize the proper score in expectation, just like you say it's an average.

(It's just that many people read 'minimize prediction error' to mean RMSE, or to assume that the prediction is a point prediction and not a distribution. Of course even among strictly proper scores there's still only a vague sense of "right", e.g. CRPS vs log skill score both meet the 'strictly proper' but have very different opinions about how bad it is if you say a particular outcome is is probability 0 and but it sometimes occurs. )

Expand full comment

Nico Formanek

Oct 13Edited

A quick question about the continued use of ADAM. If I see it correctly the mistake in the paper was pointed out around 2017, two years after its initial publication. These publications not only point out the error, but also propose fixes (with claimed empirical speed-ups). Is nobody using those? And if not, why?

Expand full comment

hiranmay

Oct 3

Thank you for your extremely regular posts, it's been a treat reading all the notes of this new course of yours.

> Let me end in a hopeful state

Agree that we are at a "local optimum" of optimisers and more work can be done.

Perhaps that requires a frameshfit from thinking of optimisers as just catalysts, in the sense that the reaction/process is going to happen, you just need to quicken/optimiser it, as measured in wall-clock time or compute or whatever to thinking of them as limiting reagents, in the sense that the choice of optimiser non-trivially affects/limits the process and thus the results i.e. quality of the models.

I liked this paper, which is in a very similar vein: "Optimizers Qualitatively Alter Solutions And We Should Leverage This" https://arxiv.org/abs/2507.12224.

Expand full comment

João G. M. Araújo

Oct 3

What's GFYPO?

Expand full comment

Alex Tolley

Oct 2

I don't wish to derail the track, but I do want to say that various iterative methods like gradient descent are computationally very expensive. One gets pretty good predictions using simpler methods that are one-shot, such as Decision Trees and their more complex version, Random Forests. These are far less CPU-intensive and produce a decent result in a fraction of the time of ANNs.

While wetware brains might superficially look like ANNs, they can learn to discriminate between different simple patterns with just a single presentation during learning, and then classify new, similar patterns based on the learned ones. This harks back to old bidirectional associative memories, which also do not require iterative approaches.

While the power of ANNs can be very impressive, we know that they do not work like wetware minds. Perhaps we are going down a path with a dead end in terms of building machines that are hugely resource-intensive to do tasks that we do with far less effort, albeit with less accuracy.

Expand full comment

Hugo

Oct 2

"And though it’s sad for me to admit, there’s a lot of mileage you can get out of having a single optimization package." I wonder where the sadness comes from.

Expand full comment

Reply (1)

Ben Recht

Oct 2

Because optimization methods are cool and interesting! And in other contexts, you can get a lot of mileage out of a diverse toolbox.

Expand full comment

Reply (1)

Hugo

Oct 2

Cool. Your actual comment and my expectation of your comment converged!

Expand full comment

arg min

Highly optimized optimizers