This is a live blog of Lecture 4 (part 1 of 2) of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents for this series is here.
Optimization has long been a cornerstone of pattern recognition. Bill Highleyman, the unsung machine learning pioneer, proposed gradient descent to minimize errors on a training set in the late 1950s. In the 1960s, researchers proposed dozens of methods to minimize empirical risk. But even back then, no one could agree on the right problem to solve. Look at this table from Duda and Hart (1973):
This table provides ten different methods for fitting a linear function to data. Here, the weights of the linear rule are the vector a. bi denotes the label of example i. And yi is the vector of features for example i. The notation (x) stands for the maximum of x and 0.
What are these methods in modern language? The first five would be considered variants of the Stochastic Gradient Descent method. Fixed Increment is the Perceptron algorithm. The Variable Increment method is SGD applied to the Hinge Loss. Widrow-Hoff is SGD on a least-squares loss. Stochastic Approximation is SGD when you assume you have a data-generating distribution rather than a finite data set. In 2023, everyone in ML theory knows these algorithms, but there was a period between 2007 and 2015 where each of these was reinvented. For example, Pegasos, a 2007 ICML paper that was an honorable mention for the 2017 test-of-time award, is the same as the Variable Increment method. None of the papers in the 2007-2015 window cited the work from the 1960s.
The next entry in the table is my favorite classification method, the pseudoinverse. This simply solves a least-squares problem to find a classifier. And the last entry is also interesting. Mangasarian showed you could minimize the hinge loss using linear programming in 1965. The first linear programming method minimizes the worst-case hinge loss. The second minimizes the average loss. Minimizing the average loss using linear programming would be called The Support Vector Machine in the 1990s. You might argue that the SVM is different because it adds the squared norm of the weights to the objective function. But this small modification just allows for an extra researcher degree of freedom. It doesn’t provide profound gains in theory or practice. And so perhaps it might not deserve the hype that accompanied the SVM. Even when I met him in 2008, Mangasarian was still annoyed that his work had been ignored during the SVM boom of the 1990s.
I only bring up our selective short-term memory here to point out that we have been obsessed with optimization in ML since the beginning, but we keep forgetting the past because no one can ever agree on the right problem formulation. All of these algorithms more or less do the same thing, and yet people write gazillions of papers on their differences. Duda and Hart noticed this too. First, the remarks column in this table is hilarious, as it shows that even they were unsure when to recommend one of these methods versus the other. But their chapter notes begin with this banger quote:
“Because linear discriminant functions are so amenable to analysis, far more papers have been written about them than the subject deserves.”
Some things never change!
Much to Duda and Hart’s chagrin, I’m sure, the most cited paper in all of optimization would end up being a 2014 paper that proves an incorrect theorem with an unparsable convergence bound for linear discriminant functions.
Why is optimization in machine learning so weird? Why do we keep reinventing variants of the same algorithm and trying to prove that these algorithms are better than each other with abstract nonsense?
Optimization is ideal when you really want to know what minimizes some cost subject to some constraints. There’s a promise that as long as you state the problem cleanly, the solver doesn’t matter. You get an answer, which answer is fine. People can write custom solvers to speed up certain problems, but at the end of the day, if the solver works correctly, the problem poser doesn’t care how you got the solution.
As a simple example, when you ask for the shortest path from A to B, you are usually actually interested in the path that is the shortest. And optimization algorithms will happily oblige.
In machine learning, by contrast, optimization will give you a function that minimizes average errors. But is that what you want? Not really! You want the point that minimizes the error on data you haven’t seen. But if we haven’t seen the data, how do we know we did a good job? We rely on benchmarks and hope for the best.
Let’s say you run the empirical risk minimization algorithm with two different solvers. Suppose Solver 2 has a bug and returns the wrong answer. You evaluate the prediction functions from Solver 1 and Solver 2 on your testing data and find Solver 2 has lower error on the test data. Which one do you submit to Kaggle? Don’t lie. We all know the answer is the one from Solver 2. In fact, we’d probably look at the code for Solver 2, find the bug, and then write a paper for ICLR about how this is not a bug but an amazing new regularization method.
Even this terrible example is too kind to what we really do with optimization in machine learning. We’ll run dozens of possible solvers with many different feature representations. We’ll take the solution of the SVM and the solution of the Perceptron. We’ll take whatever XG Boost spits out. Maybe we’ll try a transformer too. People tell me Transformers are awesome. We could even build an ensemble model out of all of the above. Anything that could possibly be fit to the training set is evaluated on the test set. And whatever gets the best error on the test set is what we write our next paper about.
Our machine learning algorithm development is what Stephen Boyd calls “graduate student descent.” Given the industrial interest, I think these days it’s better designated “GitHub descent.” Find a model on the internet, tweak a parameter or two, see if it gets better test error. If it does, that’s a paper. We’re most definitely optimizing, as we really care about these competitions on dataset benchmarks. But our algorithm is some sort of massively parallel genetic algorithm, not a clean, rigorous, and beautiful convex optimization method.
This is why I have a hard time spending too much time on optimization in machine learning courses. We don’t need to know what the Lagrange multipliers in the Support Vector Machine mean to partake in the collective genetic algorithm. It seems that having candidate models well-fit to training data is important. Beyond that, how we choose models is up to the vagaries of the benchmarks before us.
I wish it were otherwise, but I don’t know if having “principled methods” helps us in practice at all.
"We’re most definitely optimizing, as we really care about these competitions on dataset benchmarks. But our algorithm is some sort of massively parallel genetic algorithm, not a clean, rigorous, and beautiful convex optimization method."
See? This illustrates my point about Darwin being the synthesis of Hume (messy, collective lurching about in the space of algorithms, objectives, etc.) and Kant (individual engineers finding a model on the internet and tweaking a parameter or two).
What's maybe ironic about that list of methods from Duda and Hart is that it although it's a long list, it doesn't appear to include a standard statistics-style logistic regression fitting procedure (i.e. maximum-likelihood a.k.a. cross entropy, I believe typically done in batch mode). So you could easily add that to the list.