You got a 9 to 5, so I'll take the night shift
The robustness of the holdout method can't save us from populations changing.
I’ve been harping on how many foundational elements of machine learning and data science were deemed obvious and not rewarded. The most central of these is the train-test split. But throughout the history of machine learning, we have been deeply confused by why this method works. Even at the invention, Highleyman initially misestimated the amount of testing data that would be necessary. Folklore had told us that static holdout sets should grow stale and evaluating test error too many times was taboo.
But is this even true? At some point, we need to empirically validate the folk stories that we spin. Let me recall my four-step plan to machine learning riches.
Collect as large a data set as you can
Split this data set into a training set and a test set
Do whatever you want on the training set to build models.
Of all of these models, choose the model that minimizes the error on the test set.
We can justify this approach with statistics. The arguments that claim you overfit if you evaluate on the test set too much seem to be guarded by large deviation inequalities proving you can evaluate an exponentially large number of models on the test set. But such evaluations require you not to adapt to past results. If you find something that works on the test set, and tweak it a bit to eke out a bit more performance, you are adapting to the test set. The statistical bounds here are far more conservative, suggesting you can only evaluate a few models before you have ruined your test set. Given that researchers at tech companies run absurdly expensive autotuning experiments on these benchmarks, the preposterously low test set errors they report must be artifacts of overfitting, right?
I’ll confess, I was rather convinced they were overfitting in 2016. But I was very wrong. My perspective completely changed after running an experiment with Vaishaal Shankar, Ludwig Schmidt, and Rebecca Roelofs.
Ludwig wanted to create new test sets for machine learning problems. If test-set leakage, adaptivity, or overfitting (whatever these mean) had happened, we’d likely be able to find evidence by reproducing the test sets for the original problems. If we could create “i.i.d.” recreations of the test sets, we’d surely see huge performance drops on these massive models that had been iterating on data for years.
Even though the dataset itself is a bit ridiculous (identifying inscrutable 32x32 pixel thumbnails), CIFAR10 seemed like an easy first target. Alex Krizhevsky had rigorously documented the instructions for generating the data in his Masters Thesis.
CIFAR10 was a subsample of the Tiny Images Dataset. Becca, Ludwig, and Vaishaal resampled images from the data set using Alex’s methods and carefully removed duplicates. Now we just had to find out how well all of these fancy models performed on our new test set.
Here, another aspect of Frictionless Reproducibility came into play. If everyone puts their code online, it makes studying trends easier. You can use GitHub as a scientific resource, pulling codebases to cleanly compare methods. You can pull a few hundred models and test them together in the same controlled environment. This is not only metascience, but another approach to optimization under Frictionless Reproducibility.
We pulled dozens of models fit to CIFAR10 and compared their performances on the standard test set and on our new test set, CIFAR10.1. I was frankly flabbergasted by the results.
In this graph, the x-axis is the accuracy on the original CIFAR10 test set. On the y-axis is the accuracy evaluated on “CIFAR10.1.” Each blue dot represents a single machine learning model trained on the original CIFAR10 data we pulled from Git Hub. The red line is a linear fit to these models, and the dashed line is what we would see if the accuracy was the same on both test sets. What do we see? The models that performed the best on the original test set performed the best on the new test set. That is, there is no evidence of “overfitting.”
This phenomenon reproduced itself in every setting we looked at. In a much larger effort, we reproduced the Imagenet test set and saw the same plot. We saw this phenomenon on video and detection. We saw this phenomenon on NLP benchmarks. We saw this phenomenon in Kaggle competitions. Chhavi Yadav and Leon Bottou even demonstrated this phenomenon on MNIST. We didn’t overfit to the test set there either.
But we did see a performance drop. Though the ordering of models was preserved, the performance was always lower on the replicated test set. Some of these drops in accuracy were equal to years of “progress” on the benchmarks. And this was on test sets where we bent over backward to replicate the data collection process. On CIFAR10 and Imagenet, it’s possible that just having a different set of annotators was enough to induce a large drop in performance. Imagine when you move into some novel scenario with your machine learning model. There’s no way to predict how well you will do.
The takeaway from this series of studies is simple. The train-test benchmarking has absurdly robust internal validity. It’s incredibly hard to adaptively overfit to a test set. Our “generalization bounds” are wildly conservative for machine learning practice. However, external validity is less simple. How machine learning models will perform on new data is not predictable from benchmarking.
If minor differences in reproduction studies lead to major drops in predictive performance, can you imagine what happens when we take a machine learning model trained on a static benchmark and deploy it in an important application? We’ve seen AI models for radiology fail once someone changes the X-ray machine or models for sepsis fail once we change the hospital involved. These sorts of shifts of contexts and populations are the major challenge for predictive engineering. I’m not sure what anyone can hope to do except constantly update the models so they are as current as possible and hope for the best.
This means that we have to think about what we are going to use these machine learning models for. There are several possibilities. There’s the generative model approach, where we train on internet-scale data and perform mystical information retrieval to wow funders and generate bullshit. It’s an option! It’s not one I’m particularly interested in, but there is no shortage of blue-checked influencers on Elon’s webpage hawking how this will change everything.
But a lot of us want to use machine learning to build predictive systems for outcomes that impact and help people. If machine learning has extreme sensitivity to changes in context, we have to account for this in our deployments. We have to think about what we need the models to do and in which contexts they will work best. We have to think about how populations change and control for this uncertainty in some fashion. In the next half of the course, then, we turn away from predictions and move to actions. What are we actually doing with these predictive models? How can we use them to guide decision making? What are their pitfalls? Next week, we dive in.
On the monotonicity. Could it be explained by importance sampling? Test error is an aggregate metric and may be we are looking at importance sample weighted average to account for the distribution difference..
I'm a little surprised, but not overly, at the monotonicity. It makes sense that the only way to do well on every test set is to learn the precise concept. And by that, I mean not only the task, but the selection criteria and so on. What's harder for me to fathom is how narrowly *linear* that is. That means that for every X errors on the real test set there will be mX errors on some new set. Why would it be so precise?
Gael Varoquaux explained it to me once that, suppose your dataset is undersampled, and say it only spans some affine subspace, or some sub-manifold. Do you conclude that your model only describes that affine space or that manifold? No. Why? Because if there is a meaningful concept that only applies to that subspace, and that's different from the ambient concept, then in order to discover it you would have had to sample from a zero-measure subset. That doesn't happen. The only thing that has any chance of happening is missing small modes.