After class last Thursday, Ishaq Aden-Ali brought up a sneakily obvious point about how theorists approach learning theory: it starts from the position that test sets don’t exist.
This is not to say that learning theory can’t analyze methods that use test sets. Some theory papers talk about using test sets for generalization (notably this one and this one). It’s not hard to prove theorems about the utility of test sets. But when you dig into the history of the subject or the most famous textbooks, you see that the test set never gets mentioned.
For example, the most famous result from the 1960s is Novikoff’s Mistake Bound for the Perceptron. This online learning result proves that if you have data separable by a hyperplane, the Perceptron algorithm will make a finite number of mistakes no matter how many data points you have. There is no test set in Novikoff’s bound. You don’t need one! In the model where your data is separable by a hyperplane, if you run Roseblatt’s algorithm for long enough, you generalize perfectly.
Less well known, though frequently mentioned on this blog, Bill Highleyman arguably invented empirical risk minimization in his PhD thesis at Bell Labs. Bill’s first paper on ERM also has no mention of test sets. He has a theorem that asserts that with a finite data set, the empirical risk is an unbiased estimate of the average prediction error. This is only true for a single function, not for multiple functions. Highleyman’s result caught the eye of Russian mathematicians, including Alexey Chervonenkis (the C in VC Dimension). In 2004, Chervonenkis recalled an argument about Highleyman’s paper with Yakov Khurgin in 1965:
“The justification of convergence was rather ‘wild’… We gave explicit examples where getting an acceptable result required an arbitrarily long training sequence (even for linear rules). It is here that uniform convergence was mentioned for the first time. Khurgin was saying, ‘You are playing on the non-compactness of the Hilbert ball,’ or ‘You are demanding uniform convergence.’ I was replying, ‘Yes, we are!’”
The whole recollection is a fun read. Chervonenkis describes arguments about overparameterization, the first leave-one-out proofs, and a colorful metaphor comparing optimization to visiting a “venereal clinic” (h.t. Maxim Raginsky). But he never mentions testing sets.
In fact, generalization bounds pioneered by Chervonenkis and Vapnik assume you don’t need a test set. You get bounds that argue your generalization error will be small if you take the model that minimizes the training error. With appropriate simplifying assumptions about the behavior of the world, these theorems assert that you don’t need a test set. The output of your optimization algorithm generalizes. These are interesting theorems! They certainly give us some actionable insights, telling us facts about simple linear rules with bounded coefficients. However, it is hard to argue that these theorems say much about actual practice,
Somehow, everyone missed Highleyman’s other 1962 paper, which tried to build a theory of test sets. Here’s the key excerpt from the abstract:
“[T]he problem of the optimum partitioning of a sample of fixed size between the design and test phases of a pattern recognition machine is discussed. One important nonparametric result is that the proportion of the total sample used for testing the machine should never be less than that proportion used for designing the machine, and in some cases should be a good deal more.”
Highleyman argues that everyone uses holdout sets for evaluation, but we don’t know why they work or have rigorous guidance for how to build them. Frankly, he comes to confusing conclusions. Why does the test set need to be larger than the training set? His reasoning is convoluted. Not only is the paper hard to follow, but I’m pretty sure his mathematical derivations are incorrect. However, what is clear is that the method he is attempting to analyze is precisely the holdout method we use today, not the more sophisticated statistical methods like cross validation. Indeed, Stone’s famous paper on cross validation comes a decade after Highleyman and a year after Duda and Hart’s Pattern Recognition text proposes the holdout method for model selection.
Why did machine learning theory never engage with the holdout method? Partially, it’s because machine learning practice didn’t fully embrace the holdout method until the 1990s. This can’t be the whole story, though. The holdout method has been around for as long as empirical risk minimization and mistake bounds.
Whatever the reason for the omission, we can’t claim a “theory” of machine learning that doesn’t rest on a theory of the holdout method. The last forty years have proven the test set to be as important as the training set. The holdout method has remained a methodological standard, even as other algorithmic fads for fiddling with training data have come and gone. The evaluation is more important than the learning algorithm.
Can I close with a call to theoretical action? There are still a lot of questions about how these holdout sets shape our choice of prediction functions. For example, I still don’t have a satisfactory answer for why we robustly observe these “generalization scatter plots” we first saw when reproducing CIFAR10.
Are these plots explainable by theory? Practitioners might just say it doesn’t matter, but I’d bet that if we better understood the causes of this trend, we’d be better equipped to make sense of the current benchmark wars that dominate our LLM discourse.
I was an early partisan of the fundamental importance of the test set! In this 2005 NIPS paper with Lana on estimating the intrinsic dimension (our first paper together!), we had to use test-set estimates in order to correct for the negative bias of ERM, which led to deflated estimates in earlier work: https://proceedings.neurips.cc/paper_files/paper/2005/hash/892c3b1c6dccd52936e27cbd0ff683d6-Abstract.html.
Also, estimating the intrinsic dimension! Does anybody care about manifold learning anymore? Back then it was a thing (if only a minor thing), but here's also a bittersweet tidbit: Partha Niyogi was chairing the session in which I gave the talk and Sam Roweis asked the first question.
Isn't one factor in the explanation the fact that "classic" learning theory was mostly heavily biased to analyze "worst-case" scenarios? A randomized hold-out set is somewhat opposed to this approach (as it will implicitly be asking what happens if you made a misrepresentative / "bad" train-test split?)
On the other hand, it will probably also be hard to find explicit discussion of the hold-out evaluation practice in the works that tried to develop alternative theories of generalization, to overcome that limitation [I'm thinking of the Seung, Sompolinsky, Tishby "Statistical mechanics of learning from examples" paper/s, and related ideas].