Reshelving generalization

You don't need a theorem to argue more data is better than less data

Oct 09, 2025

This is a live blog of Lecture 12 of the 2025 edition of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents is here.

We now turn to the mystical part of the machine learning course. When does a piece of computer code that makes good predictions on a sample of collected data make good predictions on data we haven’t seen? This question asks us to be metaforecasters. To predict prediction errors. Answering the question requires a theory of how collected data relates to the data in the wilds. In machine learning, this metaprediction problem is called generalization theory. In the applied sciences, this is the central question of external validity.

The most pervasive model relating the lab to the field assumes (a) data is randomly generated by Omniscient Jones rolling the same many-sided die to select each sample and (b) this data-generating process never changes. We call this the iid model,1 and it’s “useful” insofar as it gives us some rules of thumb to think about our methods. I put the scare quotes around “useful” because, like so many things in statistical theory, “useful” often just means “used a lot.”

People love the iid model because it lets them apply the weak law of large numbers with abandon. In the iid model, averages of samples are reasonable estimates of averages of the entire population you didn’t see. Sample means converge to population means. Moreover, the variance of your estimate scales inversely with n. The iid model justifies applying actuarial methods.

Let’s apply the model to machine learning. Suppose I have a potential prediction function, and I evaluate its error on an iid sample of size n. I can say that the error out of sample should be no worse than the average in-sample error plus some correction that goes to zero at a rate of n^-½. The theory of generalization is just this with a lot more math grinding to make us feel smarter than we are.

This isn’t to say the iid model is all bad. There are still places where you are intentionally generating the randomness to collect data. In such cases, deductions from iid modeling provide actionable guidance. If you want some reasonable rules for quality control or clinical trials, iid sampling gives you a way to reason about your rulemaking.

Unfortunately for machine learners, we get the data we get and use the same methods no matter what. You’re going to train that transformer whether or not your data is an iid sample. And you can never test whether your data was generated in the iid model, nor can you test if it will be generated in the iid model tomorrow.

The most gracious way I can describe iid analysis is that it provides a means to compare different methods. One of the first generalization theorems was proved by Tom Cover and Peter Hart. It stated that in the iid model, the nearest neighbors algorithm has an error no worse than the optimal population prediction function in the limit of infinite data. In modern terms, the nearest neighbors algorithm should achieve zero errors on the venerable ImageNet classification problem if it has access to a sufficiently large set of images from the internet. The problem with Cover and Hart’s theorem is it doesn’t estimate how many images you need to get near zero error. Generalization theory aims to determine the amount of data required for a model to be sufficiently accurate for deployment. It is supposed to give us a sense of how big our data sets should be. And it’s supposed to tell us that some machine learning models are less sample hungry than others.

But it doesn’t. I don’t believe in the tea-leaf reading that scaling law proponents do, but the most salient lesson we should have learned from the past two decades of machine learning practice is not to underestimate the impact of collecting more data. We never know how big our data sets should be, and every time we arrogantly push scaling limits, we find new interesting phenomena. The only rule we’ve learned is to make the data set bigger.

There’s thus no point obsessing too much about generalization bounds. Proving Hart and Cover’s theorem isn’t too hard, though it does rely on some continuity properties of the distribution. But in class, I’ll just prove a simple bound based on Hoeffding’s inequality and the union bound that shows that even if you evaluate many functions and pick the one with the best in-sample error, you’ll still have good out-of-sample predictions in the iid model.

That will be all I have to say and want to say about generalization theory. A decade ago, I coauthored a paper asking us to rethink our theories of generalization. Today, I’m fine putting the theory back on the shelf. All I’ve learned is that no matter how much math we use, generalization theory is not predictive of what happens in practice. Despite 75 years of trying to come up with a metapredictive framework to estimate the amount of data needed to make predictions about the future, the answer is always “more.”

Making matters worse, most of the metapredictions of iid theory are refuted experimentally when we can’t control the sampling process. It’s very easy to fool yourself into thinking that stuff like overfitting and p-hacking is a problem if you cook up iid models in R or Python and run simulations. But if you don’t see the same things happening on real data, it means your simulations are not an accurate reflection of reality. It means your model isn’t useful! We’ll talk about this more next week.

If you care, iid is an abbreviation of independent and identically distributed.

Ani

Oct 9Edited

Hi, very interesting blog.

1. There is this term "memorization" that seems to be thrown around a lot. What is the difference between memorization and generalization?

2. Is there any theory of what happens when the i.i.d assumption is broken? For example something that mathematically quantifies the "brokenness" of the i.i.d assumption, then provides a guarantee on the sample complexity?

3. How does over-parameterization and over-fitting relate to generalization theory?

4. Are there any instances (even "hypothetical") where "more data" is not good? Perhaps if you've built your model using i.i.d assumption and this assumption does not hold in practice, so the model just "collapses"?

Edit: It seems like law of large nos. just says that sample mean gets really close to population mean for a function f. However, it does not talk about how the model trains by increasing the size of the sample. So even without breaking this law and the samples are i.i.d from the population distribution…what stops me from saying that training on increasing sample sizes gives me worse performance on population?

5. Also, how does feature representation effect the generalization bounds?

Expand full comment

1 reply

Badri

Oct 10

Ben, I believe you may have gone a little too far in this post 😊

Theory wasn’t ever wrong - it was too pessimistic to be useful with the current mathematical tools we have. It is only misleading if we somehow interpreted upper bounds as lower bounds and hallucinated principles from those estimates. Also, is it not remarkable we can bound the population error at all and show the correct scaling with number of datapoints?

May be it is a question of degree… for a fixed model and fixed number of samples, some tweak seems to be needed because the training error is not the test error minimizer. Too bad theory is not able to provide the best guidance here - like Moritz shared in a talk, just about any intuitively stability inducing technique seems to help in a regime where data and compute are bounded.

Perhaps, we need the right formulation of this problem. One attempt: let’s say you cannot change the model architecture and size or dataset - what is the best recipe for minimizing holdout error? We can introduce any number of hyper parameters but we need a simple efficient procedure to minimize test error. Further challenge: can we make this recipe independent of optimization procedure. (For instance I may want to use Newton or quasi-Newton method to optimize and I can stop at a duality gap recommended by the recipe)

I don’t want to be fiddling with inscrutable hyperparameters and would love may be a new theory to guide!

3 more comments...

arg min

Discussion about this post