I don't know what I've been trying to prove.
Yeah, yeah. Why I feel like I do about generalization theory.
This is a live blog of Lecture 6 (part 3 of 3) of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents for this series is here.
Since the generalization blogs this week have gotten little reaction on the interwebs, I was quite surprised by the excited conversations in class yesterday. Because I’m rather disappointed after twenty years of thinking about generalization theory, it’s hard to write the uniform convergence bound on the board with a straight face. But the class pressed me to explain in detail why I was so skeptical, and I think many made compelling arguments that I was being too harsh.
There is something wondrous about applying the Law of Large Numbers and concluding you can look at an infinite set of hypotheses and still have the empirical errors all be close to the errors on data you are yet to see. This sort of inductive power looks incredible, and that it follows from undergraduate probability is kind of amazing. I’ll admit it: my mind is still regularly blown by the concentration of measure phenomenon. But it is not clear that these probabilistic bounds help in machine learning because they assume an incorrect model of data generation.
After twenty years of being fascinated by the mathematics, I’m convinced generalization theory is almost entirely useless. I spent countless hours learning about covering numbers and normed inequalities in Banach Spaces. But the guidance that comes out of this mathematics for machine learners just seems to be bad advice. And if your theory gives bad advice, perhaps we should drop the theory.
Let me dive into the first fallacy that bothers me. Much of generalization theory, including bounds based on uniform convergence and algorithmic stability, begins with the notion that empirical error (the error on the data I’ve seen) should be close to population error (the data I haven’t seen). The last decade of machine learning practice has emphatically demonstrated that this is not a desideratum.
The best machine learning models have zero empirical error. Telling people that the empirical error is a good proxy for the population error is counterproductive advice. But if the empirical error is always zero, what do these generalization theory bounds tell us to do? They say you should pick the “smallest” function space that gets the empirical error to zero.
But this is also terrible, counterproductive advice for practice. All of these models get zero empirical error on the CIFAR10 benchmark:
A look-up table
A quadratic polynomial
A 2-layer fully connected neural net
An Alexnet model
A Wide ResNet
These function spaces are of increasing complexity. And all of them can fit any data pattern you’d like with 50,000 examples. But I have ranked them by their test error. The look-up table gets 90% test error. The quadratic fit gets 64% test error. The wide ResNet get 2% test error. Why? Generalization theory has no answers.
OK, now you might point to the margin-esque generalization bounds like we proved using online-to-batch conversion. But even for linear problems, there is rarely a correlation between the post-hoc margin measured after training and the performance on a benchmark. The margin bounds are just upper bounds, and they don’t help us other than to say that the error should go to zero if we can get more data to train on.
This seems to be the only generally valid advice from generalization theory: “larger n should yield smaller population error.” I suppose this is more or less true. But how much complex geometric functional analysis do we need to prove that more data is more better?
The main goal of applied math is to guide practice. We want theories that, while not perfect, give reasonable guidelines. But the advice from generalization theory just seems bad. I swear that all of the following bullets were lessons from my first ML class 20 years ago and are still common in popular textbooks:
If you perfectly interpolate your training data, you won’t generalize.
High-capacity models don’t generalize.
You have to regularize to get good test error.
Some in-sample errors can reduce out-of-sample error.
Good prediction balances bias and variance.
You shouldn’t evaluate a holdout set too many times or you’ll overfit
This is all terrible advice!
If our theory gives bad advice for practice, we have to come clean and admit our theory has failed.
Why machine learning theory failed is a fascinating story for another day. The issue with a lot of statistical theory is that we take metaphysical beliefs (e.g., data is i.i.d.), turn these beliefs into mathematical axioms, and then forget that these beliefs weren’t particularly grounded in the first place. What is the value of a bunch of theorems based on faulty axioms?
Rather than answering that question, let me end this blog with a proposed path out of this mess. What does “work” in practice? It’s hard to argue against this four-step procedure:
Collect as large a data set as you can
Split this data set into a training set and a test set
Find as many models as you can that interpolate the training set
Of all of these models, choose the model that minimizes the error on the test set
This method has been tried and true since 1962. You can say that step 4 is justified by the law of large numbers. Maybe that’s right. But there’s still a lot of magic happening in step 3.
I restructured the machine learning theory class this semester to front-load the theory for linear functions. I think it’s important to at least see the standard theory so we can discuss why it tells us very little about practice. But the second quarter of this class will now try to figure out how ML practice works. We’ll turn to feature generation and try to understand how we decide on representations. From there, we’ll discuss what happens when we try to fit models by iteratively tweaking these features. And then we’ll return to the resilience of the train-test paradigm. Let us get swept into the depths.
On Twitter, Lily-Belle Sweet brought up a great question. How do we justify leave-one-out error or the holdout method or cross validation without an appeal to i.i.d.?
Here’s my take on this. When we collect a training data set, it serves as a population itself. If we subsample from this training data, then the i.i.d. assumption holds because we are being intentional. Hence, bootstrap methods are telling us something about internal validity. They are telling us something about how the method performs when the superpopulation is our training set. Then, to generalize beyond this to new data, we just have to convince ourselves that the training set is a representative sample of the data we’ll see.
Thoughts?
"If our theory gives bad advice for practice, we have to come clean and admit our theory has failed."
I agree with this, of course. It is important to take the next step though and state that we need new theory, consistent with empirical evidence.
Btw, I also find uniform-type bounds quite beautiful. That is a big part of their appeal, I suppose.