This is a live blog that bridges Lectures 6 and 7 of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents for this series is here.
No one defended Generalization Theory in the comments yesterday. Well, shoot, so now what? Do we call off the COLT conference? Maybe instead, let me try to mount my own defense of theory.
Inspired by some questions from Lily-Belle Sweet and a heated argument with Ludwig Schmidt, here is my best attempt at a steelman argument of what machine learning theory can argue about machine learning practice.
Rather than assuming you sample data at random, treat the data set as the superpopulation.
A split of this data into a train and test is an iid sample from this superpopulation
Once we have done this split, statistics says we can evaluate many models on the test set.
Which models do we evaluate on the test set? Well, anything goes here.
This is more or less what we do in machine learning. And it gives some sense of interval validity to our practice.
On the one hand, I’d urge us not to think this is the only obvious way to systematize pattern recognition. Researchers didn’t coalesce around this holdout method until the 1960s. Other proposals proceeded Perceptrons, risk minimization, and their nonlinear variants in the 1940s and 1950s. Running Kaggle competitions on fixed data sets quickly became best practice in the 1960s, but this was also not obvious. Highleyman only formalized the notion of train-test splits in 1962, and his analysis suggested the need for tons of data. He needed to be corrected by the Russian camp in terms of large deviation theory. Our current machine learning practice is only obvious in hindsight. And we’ve been trying to make better sense of it as our mathematical tools have improved in the interim.
But it’s still not great, folks. I can already nitpick Step 2, because we always do without replacement balanced samples to build benchmarks. But Step 3 seems to be a lot of post hoc arguments because we don’t see overfitting. All of the probabilistic bounds would suggest that any model that gets under 2% error on MNIST is as good as any other. We know this is not the case. It’s only worse for more complex benchmarks. To make matters even worse, there are arguments from theoretical computer science showing that nefarious things can happen when you adaptively choose which functions you evaluate based on the ones you’ve already studied. I don’t think theory gives us much insight into why we can evaluate trillions of ideas on benchmarks and have the models work well on other problems.
But does it even matter? What’s happening in step 4? This seems to be where all the action is. If I wanted to try to use “theory” to tell you what to do in step four, I’d try this: most generalization bounds are just saying that models with large margin generalize. So try to find models that capture some notion of large margin on the training set because we expect these will do well on the test set. That is, pick models that interpolate or nearly interpolate the training. Save yourself computation time and restrict your attention to models that train quickly on the hardware you can access. And have some structures that match your beliefs about what a good prediction function should look like.
That’s some weak sauce! In particular, that last sentence is some vague nonsense. What structures match my beliefs about what a good prediction function should look like? Using our machine learning jargon, what makes a good feature?
We’ve conveniently abstracted everything in machine learning to iteratively pushing vectors around. But which vectors should we use? How do we decide how to take reality and turn it into “data”? What do we put in our excel spreadsheets?
Harkening again back to those halcyon days of Bell Labs in the 1950s, I don’t think we appreciate enough that we represent digits as digital images, a box of numbers. [The introductory paragraphs of this paper by Highleyman and Kaminsky] suggest a Wild West in 1950s character recognition. There were varied proposals to optically capture characters for recognition. Highleyman and Kaminsky proposed a general-purpose machine that would turn the characters into an intermediate form so that all methods could be conveniently compared before deciding on the best. This sounds stilted in 2023: they decided to turn the handwritten text into a digital bitmap and then build algorithms that took this bitmap and decided what character it represented.
Just like the train-test split, it seems quite obvious in hindsight that OCR requires an image first. But hindsight bias is a powerful drug. We underappreciate how powerful pixels are. With zero parameters, I can build a character recognition system that gets 1% error on MNIST and works even if the pixels in the images are given to me in a random order (so no spatial information about the pixels is used). I haven’t run the experiment myself, but I don’t think this is something a human can do.
What digit is this?
These pixels capture so much useful information that I can use Aizerman’s potential function method and get around 99% accurate digit recognition. How can that be possible?
In today’s class, I want to think about what we need to capture datasets with recognizable patterns. What makes a good feature? Are there guiding principles at work here? Is the success of machine learning built on the back of the foundations of signal processing? I again have lots of questions! I’ll report back tomorrow with what we decided.
Enjoying this series. Not sure I want to step into the fray on generalization because I don't think I know enough. Hasn't there been some theory that deals with the question of "match your beliefs about what a good prediction function should look like"? There's some work that's shown how structure of networks (like convolutions) is good for image-like structure. That seems like a good direction if you can make some assumptions about the right prediction function... although that question seems fairly un-answerable for lots of real-world data.
What I'm also uncertain of is whether current theory tells us enough about how the structure of the input data affect the problem. To take your example of images, pictures of things in the real world have a lot of nonlinear structure that makes them inherently lower dimensional than a vector of dimension # pixels. Can current theory explain success of ML in terms of this? I am thinking of the "adaptivity to reduced support" for linear subspace dependence from https://francisbach.com/quest-for-adaptivity/ which says even kernel methods should be okay. Does this help "rescue" generalization theory a little bit?