Highly optimized optimizers

Justifying a laser focus on stochastic gradient methods.

Oct 02, 2025

This is a live blog of Lecture 10 of the 2025 edition of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents is here.

It’s undeniable that everything in machine learning is an optimization problem. The fundamental problem of machine learning is an optimization problem: minimizing average prediction errors on data we haven’t seen yet. In practice, we more or less do this by minimizing average prediction errors on the data we’ve collected so far. Given its fundamental position, how much numerical optimization should we learn in a machine learning class?

I’m not sure I like my answer, but I’ve converged on a single week dive into stochastic gradient methods. The vast majority of machine learning problems are now solved using some variant of such methods. And they are the common core connecting the first machine algorithm—the perceptron—to modern algorithms like the default-solver-of-all-problems ADAM and advanced reinforcement learning algorithms like GFYPO.

This pedagogical choice means I will sadly neglect some useful and interesting methods. I’d like to talk more about least-square solvers. I’d like to talk more about quadratic programs and their rich history in pattern classification and learning. I’d like to talk about boosting, which is an undeniably useful application of the coordinate descent method. Frankly, I wouldn’t even mind talking about expectation maximization, which was a widely applied heuristic solver in machine learning for decades. Those methods will have to wait for when I reboot my optimization class.1

Let me justify my decision to be narrow. First, stochastic gradient methods have accessible and informative theoretical foundations. The theory is far more prescriptive than what we saw in the last two lectures about features and function approximation theory. Approximation theory gives us some vibes about how to pick activations, but it doesn’t really tell us what to do. Optimization theory, on the other hand, provides us with immediately actionable insights. The theory of stochastic gradients is mature enough now to tell you what to try when your weights and biases curve stops going down. It’s also helpful to understand this theory even if you’re not doing machine learning, as you can apply these ideas in other contexts.

And though it’s sad for me to admit, there’s a lot of mileage you can get out of having a single optimization package. Optimizers love to present a vast toolkit, but most people don’t want to have to think about how to initialize Gauss-Newton methods or what have you. They want a default optimizer that works out of the box. This is why people have convinced themselves that the ADAM optimizer [insert obligatory citation to buggy Kingma Ba paper here] works with the default parameters. It’s one less thing to think about when trying to tweet your way to a new conference paper or product release.

But it’s helpful to understand the reasons why stochastic gradient methods are the default methods. As I mentioned, understanding some of the theory of stochastic gradients will also let us get a handle on adaptive experiments and reinforcement learning practices later in the semester.

Unfortunately, understanding the actual algorithms that everyone uses today is much harder. It’s pretty weird that a decade in we still don’t have a slam dunk case for why ADAM is a good optimizer. ADAM is the most cited optimization paper of all time, by a large margin. It’s the backbone of all LLM optimization. It’s the default optimizer everyone uses. And optimization theory can’t explain why.

In fact, in theory, ADAM is a bad idea. It’s not hard to show that gradient descent can always yield as good an in-sample performance as adaptive methods on linear prediction problems. Linear ones are the only ones we can analyze well, so I’m not sure why you’d hope for a better result for nonlinear methods. It’s easy to build simple examples where adaptive gradient methods have worse out-of-sample performance than non-adaptive ones.

But whatever, I’m willing to totally accept that ADAM with default parameters works better on everyone’s deep learning problems. I can actually explain this with a theory of optimization. Machine learning research is a giant, parallel genetic algorithm, fueled by an infinite pool of funding from megacompanies and venture capital. When you have infinite money and time pressure, the room for innovation is pretty small. So we run frenzied to the next deadline, building piles upon piles of technical debt.

It’s quite possible that no variant of stochastic gradient descent with momentum is going to outperform ADAM on the current architectures that have solely been trained and fine-tuned with ADAM. Given all of the tricks with bit precision, attention caching, and fine-tuning to hardware idiosyncrasies, our optimizers might be at a local optimum. This doesn’t mean that there isn’t a thousand-fold more efficient way to build language models sitting out there. But if you can burn a dollar and earn two back in venture capital, why think about being efficient?

Let me end in a hopeful state. If you have a little bit of time and theory on your side, I’m sure you can find more efficient means to build these giant models. Any time I get a chance to think about a specific machine learning optimization problem, I always find ways to clean things up. In our current frenzy, we’re not allowed such luxury of thought. But the time might not come once the money runs out.

Spring 2027, perhaps? Though reliable sources at Anthropic tell me there will be no graduate school in 2027. Oh well.

Uday Singh Saini

Oct 2

https://arxiv.org/abs/2507.07101? This is a recent work showing SGD isn't so bad.

Expand full comment

Matthew

If ADAM is a bad idea, does that mean Newton's method is a bad idea, too? Assuming I understand correctly, ADAM is the first-order shadow of Newton's method arising from a diagonal approximation to the Hessian. I've always viewed Newton's method as the gold standard and SGD as the necessary compromise because Hessians are too big for models with many, many parameters.

1 reply by Ben Recht

11 more comments...

arg min

Discussion about this post