This is a live blog of Lecture 12 of the 2025 edition of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents is here.
We now turn to the mystical part of the machine learning course. When does a piece of computer code that makes good predictions on a sample of collected data make good predictions on data we haven’t seen? This question asks us to be metaforecasters. To predict prediction errors. Answering the question requires a theory of how collected data relates to the data in the wilds. In machine learning, this metaprediction problem is called generalization theory. In the applied sciences, this is the central question of external validity.
The most pervasive model relating the lab to the field assumes (a) data is randomly generated by Omniscient Jones rolling the same many-sided die to select each sample and (b) this data-generating process never changes. We call this the iid model,1 and it’s “useful” insofar as it gives us some rules of thumb to think about our methods. I put the scare quotes around “useful” because, like so many things in statistical theory, “useful” often just means “used a lot.”
People love the iid model because it lets them apply the weak law of large numbers with abandon. In the iid model, averages of samples are reasonable estimates of averages of the entire population you didn’t see. Sample means converge to population means. Moreover, the variance of your estimate scales inversely with n. The iid model justifies applying actuarial methods.
Let’s apply the model to machine learning. Suppose I have a potential prediction function, and I evaluate its error on an iid sample of size n. I can say that the error out of sample should be no worse than the average in-sample error plus some correction that goes to zero at a rate of n-½. The theory of generalization is just this with a lot more math grinding to make us feel smarter than we are.
This isn’t to say the iid model is all bad. There are still places where you are intentionally generating the randomness to collect data. In such cases, deductions from iid modeling provide actionable guidance. If you want some reasonable rules for quality control or clinical trials, iid sampling gives you a way to reason about your rulemaking.
Unfortunately for machine learners, we get the data we get and use the same methods no matter what. You’re going to train that transformer whether or not your data is an iid sample. And you can never test whether your data was generated in the iid model, nor can you test if it will be generated in the iid model tomorrow.
The most gracious way I can describe iid analysis is that it provides a means to compare different methods. One of the first generalization theorems was proved by Tom Cover and Peter Hart. It stated that in the iid model, the nearest neighbors algorithm has an error no worse than the optimal population prediction function in the limit of infinite data. In modern terms, the nearest neighbors algorithm should achieve zero errors on the venerable ImageNet classification problem if it has access to a sufficiently large set of images from the internet. The problem with Cover and Hart’s theorem is it doesn’t estimate how many images you need to get near zero error. Generalization theory aims to determine the amount of data required for a model to be sufficiently accurate for deployment. It is supposed to give us a sense of how big our data sets should be. And it’s supposed to tell us that some machine learning models are less sample hungry than others.
But it doesn’t. I don’t believe in the tea-leaf reading that scaling law proponents do, but the most salient lesson we should have learned from the past two decades of machine learning practice is not to underestimate the impact of collecting more data. We never know how big our data sets should be, and every time we arrogantly push scaling limits, we find new interesting phenomena. The only rule we’ve learned is to make the data set bigger.
There’s thus no point obsessing too much about generalization bounds. Proving Hart and Cover’s theorem isn’t too hard, though it does rely on some continuity properties of the distribution. But in class, I’ll just prove a simple bound based on Hoeffding’s inequality and the union bound that shows that even if you evaluate many functions and pick the one with the best in-sample error, you’ll still have good out-of-sample predictions in the iid model.
That will be all I have to say and want to say about generalization theory. A decade ago, I coauthored a paper asking us to rethink our theories of generalization. Today, I’m fine putting the theory back on the shelf. All I’ve learned is that no matter how much math we use, generalization theory is not predictive of what happens in practice. Despite 75 years of trying to come up with a metapredictive framework to estimate the amount of data needed to make predictions about the future, the answer is always “more.”
Making matters worse, most of the metapredictions of iid theory are refuted experimentally when we can’t control the sampling process. It’s very easy to fool yourself into thinking that stuff like overfitting and p-hacking is a problem if you cook up iid models in R or Python and run simulations. But if you don’t see the same things happening on real data, it means your simulations are not an accurate reflection of reality. It means your model isn’t useful! We’ll talk about this more next week.
If you care, iid is an abbreviation of independent and identically distributed.
Hi, very interesting blog.
1. There is this term "memorization" that seems to be thrown around a lot. What is the difference between memorization and generalization?
2. Is there any theory of what happens when the i.i.d assumption is broken? For example something that mathematically quantifies the "brokenness" of the i.i.d assumption, then provides a guarantee on the sample complexity?
3. How does over-parameterization and over-fitting relate to generalization theory?
4. Are there any instances (even "hypothetical") where "more data" is not good? Perhaps if you've built your model using i.i.d assumption and this assumption does not hold in practice, so the model just "collapses"?
5. Also, how does feature representation effect the generalization bounds?
What are your thoughts on modern theory like Arora and Goyal's "A Theory for Emergence of Complex Skills in Language Models"(https://arxiv.org/abs/2307.15936)? Should we see more work like this? Less?
Generalization research in AI seems to be in the "extraordinary science" phase, to use the words of Kuhn. Theoretical progress from here will require recourse to philosophy and first principles, as well as cross-disciplinary interactions with psychology and cognitive science.