Benchmarking our benchmarks
The only validated theories of generalization are sociological and historical.
This is a live blog of Lecture 13 of the 2025 edition of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents is here.
As a discipline, machine learning gets a bad rep when it comes to evaluation.1 Indeed, nothing drives the internal excitement and progress in machine learning than the community’s embrace of competitive testing. The AI Winters occur when there are no good benchmarks to fight over. “Progress” happens when nerds have numbers to make go up, and entrepreneurs have benchmarks to sell as signifiers of sure bets.
The main competitive testing framework in machine learning goes through dataset benchmarking. Researchers compile datasets partitioned in two. There is a training set with which you can do whatever you want, and there is a testing set upon which you will evaluate. The shared community goal is to find a model that achieves the lowest possible error on the testing set.
Now, if you are just going to release errors on the testing set, it seems like you’d need to hide the answers from the competitors. Otherwise, people would just return the correct labels in the testing set. One way to get around this is to split the testing set in two, hiding half as an evaluation for the end. This is the motivation behind machine learning competitions.
But there’s another solution, which is community trust. If we are open and transparent about our methods, then these benchmarks can help us compare the merits of different ideas and settle on locally applicable heuristics for prediction. If you release code, no one will think you cheated once they run it. Though this open dialogue seems to give away the valuable secret sauce, people still build billion-dollar companies on open source software.
This paradigm of shared methods, datasets, and competition has been around in machine learning since at least 1960, barely 4 years after the Perceptron. That systems should be evaluated by averaging predictions on representative data sets has been around even longer. You can get a sense of the idea in the 1940s writings of Wiener and Shannon, who argued that prediction systems had to be evaluated on reasonable averages of future scenarios. Such evaluation was similarly promoted in psychology and social science, with Meehl’s Clinical vs. Statistical Prediction providing a notable record of the debate.
But train-test evaluation, where a data set becomes a fixed research artifact that serves as a simulator and measurement, is something particularly unique to machine learning. The original motivations for a train-test split were just engineering intuition. It was sort of obvious that if you tried to estimate the error rate on new data from the training data, you’d get misleading error estimates. Indeed, Duda and Hart note in the 1973 edition of their text Pattern Classification and Scene Analysis.
“In the early work on pattern recognition, when experiments were often done wth a very small number of samples, the same data were often used for designing and testing the classifier. This mistake is frequently referred to as ‘testing on the training data.’”
But how should the data be partitioned to give a reasonable estimate of prediction quality? Duda and Hart hedge:
“The question of how best to partition a set of samples into a design set and a test set has received some analysis, and considerable discussion, but no definitive answer… When the number of samples is very large it is probably sufficient to partition the data into a single design set and a single test set. Although there is no theory to guide the designer in intermediate situations, it is at least pleasant to have a large number of reasonable options.” (emphasis added)
Since there were too many options, people just stuck with what was easy. Gather some data. Split the data in two. Release the data in public. Then it was a matter of trust within the community that what people published was honest benchmarking. This internal trust was key, and is key to all benchmarks. If everyone agrees, local expertise can be honed and perhaps generalized. More data sets can be made. More competitions can be run. Little benchmarks can drive local progress in expert communities. Since competition is done out in the open, anyone can try new proposals on their problems. We let hundreds of flowers bloom, and it’s undeniably worked well for the field. This research process is what Dave Donoho calls Frictionless Reproducibility.
Why is Frictionless Reproducibility successful? I proposed an explanation in a commentary on Donoho’s piece. We can view machine learning practice as a massively parallel genetic algorithm that fixates on goals where hill climbing is possible and ignores those where it isn’t.
But prescriptive mathematical explanations of Frictionless Reproducibility all fall short. If you are a grumpy statistician, you’ll always say stuff like “The error bars on the test sets are too small to matter. You are double-dipping. There are too many experimenter degrees of freedom. Test set benchmarking is not a severe test of a hypothesis.”
As we’ll see in the next class, our intuitions from statistics are just wrong when applied to the train-test paradigm. Even though in the last class I derived a motivation for train-test splits using Hoeffding’s Inequality, that bound is remarkably conservative. It is subject to an adaptivity critique, which suggests we can “overfit” without even looking at the training data.
Machine Learning researchers fretted about how competitive testing violates statistical intuition from the beginning. Here’s Duda and Hart again
“A related but less obvious problem arises when a classifier undergoes a long series of refinements guided by the results of repeated testing on the same test data. This form of ‘training on the testing data’ often escapes attention until new samples are obtained.”
As we’ll see in class on Thursday, there is little evidence that “training on the testing data” is actually a problem. No statistical learning theory can logically reconstruct the seventy years of applied machine learning practice. And that applied practice now props up the US economy with computer tooling that undoubtedly does miraculous things, then maybe you’re out of touch?
So instead of trying to explain the train-test benchmarking paradigm with math, today we’ll just study the history and sociology of the practice. We’ll go through various famous datasets, what went into making them, why they caught on as good test cases, and how we might apply these insights to problems we care about.
It’s helpful to appreciate the power of competitive testing because the last few years have seen an inward closure of the field. “Frontier labs” release neither code nor data. Some argue that we now need an oligarchical cold war to drive innovation in our discipline. Things seem less grim to me. Inside the companies, I’m told, the competitive testing paradigm still reigns supreme. And recent releases in language modeling suggest the public sector, embracing frictionless reproducibility, might be catching up.
Apologies to my colleague/co-instructor Deb Raji, who strongly disagrees with me.
The funny thing is that benchmarks play almost no part in the actual industrial practice of machine learning. Basic MLOps best practices dictate that your model should constantly be training and testing on new data, not a static benchmark. February's data is the training set for the March test set. And March's data becomes the new training set for the April test set. And so your machine learning system trundles forward into the future, always renewing itself, without a static benchmark in sight.
Or almost. You might keep a static benchmark for unit or integration tests, just to check if the code runs and produces sane results. And that's really where I think benchmarks should be used, by analogy with real life bench marks etched into work surfaces. They are tools for internal calibration, to check your code is working and your model is improving. Internal reuse is fine since you can't fool yourself for long. But they shouldn't be tools for external comparison between different models, because it's too easy and profitable to fool others sometimes, especially if your models are closed.
Yes, it can work if everyone is completely honest with themselves and each other. There was a golden age of open code and open models when that was true, and that was a credit to the field. But that age is closing, and people are paying less and less attention to benchmarks because good performance on them means so little now. Every machine learning paper boasts SOTA results on all the latest benchmarks, but nearly none of them actually have real world value. Every new LLM busts every popular benchmark, yet they don't actually seem to be getting more useful, just more impressive. Something has gone wrong over the last few years.