I appreciate all of the feedback I received about yesterday’s polemic. People definitely have a lot of opinions about overfitting! I’ll spend the next week going through the many rebuttals raised. The first such rebuttal was succinctly put by Ehud Karavani on Bluesky:
“Overfitting assumes iid train/test.”
This is not a uniformly held belief but one that remains quite common. Several others brought it up in email and the comments.
I’m not willing to grant that this is what overfitting assumes, but I’m happy to accept that there are multiple kinds of overfitting.
Statistical overfitting. This occurs because a test set gives a nonzero, though unbiased, estimate of prediction error.
Adaptive overfitting. This occurs because we look at the test set too much and then perform poorly on a new iid sample.
Contextual overfitting. This occurs because the new data differs in some way from the data we use to train our model.
Contextual overfitting is also known by other names, such as distribution shift or domain shift. Those aliases blame nature, not the analyst, for poor performance on new data. “Who could have expected the black swan? There was no black swan in the original data distribution.”
If you want more definitions of overfitting, tell me in the comments. Which did I miss?
For the ones I list here, the case is closed on the first two:
The holdout method does not exhibit statistical or adaptive overfitting.
This is not an “always” statement. But a long list of work led by Becca Roelofs, Ludwig Schmidt, and Vaishaal Shankar has shown that it’s true. There isn’t overfitting on Imagenet, in the sense that the better the models on a test set, the better they are on new data sets. There isn’t overfitting in NLP question answering. There isn’t overfitting on Kaggle Leaderboards. If train and test are iid. (Or nearly iid as in MNIST), we do not witness overfitting.
The holdout method works better than statistics suggest it should. All of the theorems we prove to justify the holdout method (usually doing some sort of union bound over possible tests) are laughably conservative about what happens in practice. You can even write theory papers digging into this conservatism (this one or that one).
Let me be as clear: it’s possible for ML engineers to initially “fit the training set too quickly,” so the test error goes up. This certainly happens, and then they have to deploy tricks to “fit the training set more slowly” or whatever. Add weight decay or dropout or batch norm or whatever is trendy today. Go for it. I don’t even know the best ones anymore because you all are writing tens of thousands of machine learning papers every year. I’m not arguing against the art and skill of machine learning engineering.
If you want to call “noob errors in getting good performance” overfitting, we can do that. But that is not how people use the word. Overfitting implies a false confidence in results. That you get a new “iid” sample from somewhere and make a surprisingly bad prediction. If you use the holdout method, then this doesn’t happen.
Contextual overfitting, on the other hand, is very real. And it’s often possible to blame the data scientists for its occurrence. There is an apocryphal story about an early pattern recognition challenge sponsored by the military. The goal was to detect whether tanks were in images. However, in the competition, the machine learning methods overfit to the cues in the background. Images with tanks had different weather conditions than images without tanks, and the algorithms attended to the weather, not the tanks. On new tank images, the machine learning methods performed poorly.
This is a cautionary tale, but there is no evidence it ever happened. However, I still think we can use the Soviet Tank Problem as a machine-learning Aesop fable. Is this a parable of overfitting? What word would we use here to describe that the machine learning algorithm is latching on to the wrong “concepts” in the train/test corpus? The issue here seems to be the data is not fully representative of how we will evaluate the algorithm in the field. You could say the “testing distribution” “shifts,” but that’s not a precise description of the problem. I’m using scare quotes because these terms are about as precise as “overfitting.” The problem is you collected data that was insufficient to pin down the prediction problem for a machine learning system. Because pattern recognition is atheoretical, the only way we can articulate our evaluation expectations is to declare data is representative and sufficient for statistical pattern recognition. In other words, the Soviet Tank Problem is an evaluation problem.
The tank fable is useful because even if this particular story didn’t happen, you don’t have to look too hard to find datasets and benchmarks manifesting the same issue. In using ML to detect malignant skin tumors, many images of malignant tumors had rulers in them, while the benign ones didn’t. In using ML to detect pneumonia, many of the images were of patients already treated with visible chest drain tubes. Researchers at Google showed that facial recognition latched onto silly cues in the hyperbolic publications of business school fabulist Michal Kosinski.
The tank story might be an urban legend, but people are still building datasets where spurious cues are most salient for the prediction problem on the provided data. The Soviet Tank Effect is ubiquitous. Is it overfitting? You could certainly say that your data set overfit to the prediction task. Regardless of what you call it, the Soviet Tank Problem is a data curation problem. The machine learning worked as intended.
The Soviet Tank Problem is really hard to avoid in practice sometimes. I always go to an example from my lab where a team built a specialized object classifier on a mobile app. They made sure to train with all the objects on a variety of realistic background types: grass, sand, asphalt, etc. Then they had users test the app on the same background types, as well as new ones to see how it would generalize. Performance was good on the known backgrounds, except it dropped on grass. The reason? They trained it on grass in Maryland (green), but the users tested it on grass in southern California (brown).
I committed a version of the "Tank" error. We were developing a computer vision system to classify freshwater macro-invertebrates to genus. Images were collected by a robotic device where the specimen was dropped into an alcohol-filled container and photographed against a beautiful blue background. My colleagues collected 100 specimens each from 54 genera. Each specimen was put in its own vial, and the vials were put in boxes, sorted by genus. We hired some undergrads to do the photography. Naturally, they came in on Day 1 and opened box 1 and photographed everything in it. On Day 2, they did boxes 2 and 3, and so on. It turned out that each day, different bubbles formed in the alcohol around the edges of the visual field. These basically bar-coded the class. We got suspicious when the classifier was too accurate. To figure out what was going on, we constructed what would now be regarded as very simple visual explanations. These revealed that the classifier was looking at the bubbles and not the specimens. Face plant! I had made the "tank in the trees" error!
Fortunately, it was easy to mask out the bubbles. But the lesson I learned was one we teach in intro statistics: Always randomize everything you can think of. If we had randomized the order in which the specimens had been photographed, we would have broken the statistical link.