18 Comments

The Soviet Tank Problem is really hard to avoid in practice sometimes. I always go to an example from my lab where a team built a specialized object classifier on a mobile app. They made sure to train with all the objects on a variety of realistic background types: grass, sand, asphalt, etc. Then they had users test the app on the same background types, as well as new ones to see how it would generalize. Performance was good on the known backgrounds, except it dropped on grass. The reason? They trained it on grass in Maryland (green), but the users tested it on grass in southern California (brown).

Expand full comment

100% this. It's really hard! My blogging tone here comes off as dismissive, but I'm definitely not arguing that avoiding the Soviet Tank Problem is easy or just a noob mistake. It's the Achilles Heel of machine learning.

Because machine learning is atheoretical, all we have is deeply theory-laden *data*. But our language to talk about the theories undergirding our data sets and benchmarks is woefully poor.

Expand full comment

Finally, something I can agree with! :)

Expand full comment

I committed a version of the "Tank" error. We were developing a computer vision system to classify freshwater macro-invertebrates to genus. Images were collected by a robotic device where the specimen was dropped into an alcohol-filled container and photographed against a beautiful blue background. My colleagues collected 100 specimens each from 54 genera. Each specimen was put in its own vial, and the vials were put in boxes, sorted by genus. We hired some undergrads to do the photography. Naturally, they came in on Day 1 and opened box 1 and photographed everything in it. On Day 2, they did boxes 2 and 3, and so on. It turned out that each day, different bubbles formed in the alcohol around the edges of the visual field. These basically bar-coded the class. We got suspicious when the classifier was too accurate. To figure out what was going on, we constructed what would now be regarded as very simple visual explanations. These revealed that the classifier was looking at the bubbles and not the specimens. Face plant! I had made the "tank in the trees" error!

Fortunately, it was easy to mask out the bubbles. But the lesson I learned was one we teach in intro statistics: Always randomize everything you can think of. If we had randomized the order in which the specimens had been photographed, we would have broken the statistical link.

Expand full comment

I don't consider this to be overfitting. As Ben says, the data set *is* the specification of the task, and we had collected a dataset for a confounded version of the task.

Expand full comment

The topic of 'overfitting' appears wholly focused on how data sets are employed in training. In earlier incarnations of multi-layered models the over fitting concerns included the hazard of excess model order with too many layers and/or too many internal states. Do such concerns continue to get attention ?

Expand full comment

Yes, the last decade has shown that model order is both hard to characterize (number or parameters doesn't necessarily indicate complexity) and not necessarily an impediment to good performance. There's a lot of interesting empirical research on this, but it's probably best demonstrated by the success of large generative models.

Expand full comment

I still feel that I have not completely grasped why adaptive overfitting is not a problem. I guess it has to do with the falsity of the null-support thesis i.e. that there is a "distinction [...]to draw between a theory that has independently predicted an observed effect and one that has been deliberately constructed to yield the effect as a consequence". But then again it needed Howson to point out this might be false.

Also, I think you missed the definition of benign overfitting.

Expand full comment

the evidence against adaptive overfitting is multifaceted.

- There's an absurd amount of empirical evidence against it. (start with our paper reproducing CIFAR10 and Imagenet and then check out the follow up. It's a robust finding!)

- Hoeffding's inequality and union bound are absurdly conservative.

- Blum and Hardt's Ladder Algorithm gives a "natural algorithm" perspective on how the holdout set works in practice.

- If models make similar predictions, then you aren't learning adaptively when you check them

- Moritz Hardt details other possibilities in his ICLR keynote: https://iclr.cc/virtual/2024/invited-talk/21799

There's no simple answer, but there is also no evidence in favor of the existence of adaptive overfitting outside of toy models.

What's your definition of benign overfitting? The one by Bartlett et al. (2019) involves the limit of the rank of covariance matrices in a rather contrived generative model. It's not operational. And then there's this paper by Mallinar et al. (2022) that defines it as "certain methods that perfectly fit the training data still approach Bayes-optimal generalization." I mean, this describes nearest neighbors. So I'm not sure that's helpful either.

Expand full comment

I did read your papers on CIFAR and Imagenet and I don't doubt the empirical evidence. But I was looking for a theoretical explanation why adaptive overfitting should not be a problem. It somehow goes against the deep grained intuition that if there is a shortcut it is going to be taken. It seems hard to completely rule out leakage across successive iterations on benchmarks on theoretical grounds alone. I wasn't aware of Moritz' talk, so thanks for linking. I'll definitely look into it.

As for benign overfitting: I don't have a definition, but just noticed that it was a term tossed around in the ML literature. I think in general people mean something like perfectly fitting the training data and generalizing well with it (like Mallinar). But why they call it overfitting escapes me.

Expand full comment

Yes, exactly. We have known since the 1960s that both linear classifiers (i.e., the Perceptron) and Nearest Neighbors can perfectly fit the training data and still generalize.

And I'm with you: I wish there was a clear argument for why we should expect to see the "on-the-line" behavior we observed in the Imagenet/CIFAR10 papers and follow-ups. It's a cop-out for me to say "it's multifaceted." But, sometimes the world refuses to give us simple explanations.

Expand full comment

As a machine learning engineer and sometimes researcher, I've seen "overfitting" be used in two other contexts:

1. When you fit a model using gradient descent or similar hill climbing methods, you may observe the loss on the training set continue to fall with each update, while the loss on a holdout validation set stops falling or even rises (which presumably indicates that the loss on the test set would also rise). At this point, the *model fitting procedure* (rather than the model itself) is said to have overfit. You can rescue this by simply taking the model checkpoint with the lowest validation set loss rather than the lowest training set loss. As you say, this has become less of an issue with more refined model fitting procedures.

2. When you test a model with inputs from a region where there were sparse training examples, the model outputs may have very little variation. For example, if you only have one example of a dog in the training set when building an text-to-image generation model, all the dogs generated by the model will not only resemble that dog, but may in fact be (almost) *exactly that dog image* even if the input is "jumping dog", "yellow dog", "upside down dog", etc. The model is then said to have overfit on that training example. This failure mode is again becoming less of an issue with more diverse training sets and synthetic data to address data scarcity.

Expand full comment

I think that Adaptive overfitting *mostly* does not exist. It exists only for test sets with very low sample size OR more importantly, low signal in predictors. Consider training many ML models on stock market data and then scoring them on a large holdout sample. The best one will have over 50 per cent accuracy (of detecting the price going up) on the hold sample. Yet, very likely it will be only overfitting, not holding to the established accuracy on another sample.

This is not a problem when a strong signal exists. But when it does not exist, adaptive overfitting is an issue.

Expand full comment

maybe "overfitting" makes the most sense when there is an adversary who is trying hard to report low error.

if you believe in this adversary (who is sometimes real, like someone marketing a backtested trading strategy, or trying to publish dubious applied ML papers), then a lot of the statistical theory makes more sense. The worst-case analysis is actually relevant in such situations.

outside of situations like that, trying to lower the training error usually helps, but implicitly there are a lot of really stupid ways of doing that, which are avoided by ML practitioners, even if they are willing to have a lot of parameters.

Expand full comment

> "Regardless of what you call it, the Soviet Tank Problem is a data curation problem. The machine learning worked as intended."

As stated, that's a false dichotomy. It can be argued that the "testing distribution" not being representative is a data curation issue, and I agree with this argument. However, _robustness_ to "testing distribution" "shifts" (using your scare quotes) is a machine learning problem (IMO). As I see it, the responsibility lies with data curation to minimize such "shifts", however, the responsibility lies with machine learning to maximize _robustness_ in the presence of such "shifts". For the moment, I am ignoring feasibility.

Considering, for a moment, the topic of whether machine learning could feasibly address this, it's clear that it cannot do so in absolute terms. I.e., the notion of a machine learning algorithm that is entirely immune to any distributional "shifts" is vacuous. In qualified terms, however, I do believe it may be feasible. That being said, it is ill-posed without first deciding on a way to measure such "shifts". The latter, unfortunately, is beyond the scope of this comment, but I hope that I have convinced you that this is an area where machine learning research could not just have something to say, but also something valuable to contribute.

Expand full comment

I would put the Soviet tank example in the “noob mistake” category as well. If we sample from distribution D (which contains pictures of tanks on rainy days) to train our model, why should we expect it to work well when we test it on distribution D’ (which contains pictures of tanks on rainy and sunny days)?

The classic example for this is the 1948 presidential election (https://mathcenter.oxford.emory.edu/site/math117/historicalBlunders/). If we only poll people with a telephone in 1948, we are sampling from a specific part of the population and our results don't generalize to the distribution of the whole population.

One might say this is different from the tank example, because in that case, there is enough information in the training data to learn the intended task. However the issue is that when we are optimizing, we are trying to do as best as we can on the distribution of training data. So if this is to be used for another distribution, we should do something to mitigate the “overfitting” issue, e.g., perform regularizations such as data augmentation. That essentially means to do something so that our samples look like they came from D’ instead of D.

Expand full comment

You can argue the Soviet Tank example is a noob mistake, except the story is apocryphal.

On the other hand, the dermatology, radiology, and face recognition examples I gave of the Soviet Tank Problem are not only real but were accompanied by New York Times articles declaring AI would put doctors out of business or could be deployed by authoritarian regimes to oppress LBGTQ+ people.

You can call New York Times reporters gullible rubes, but the Soviet Tank Problem is a real problem that continues to happen and continues to mislead people about the power of machine learning.

Expand full comment

I totally agree that people make “noob mistakes” every day in data analysis and machine learning. Just pick a random paper from your favorite ML conference proceeding.

The other examples including the dermatology, radiology, and face recognition are all in the same category. Whenever somebody claims something that is resulted from data, the first reaction should be to check the data and the code to see things have been done properly, not to write NYT articles, but unfortunately, most of the time the latter happens. Most often people who make these claims or write these papers have not even taken a proper probability or statistics class, let alone a learning theory class, and expect unreasonable things from the machine learning models. It’s like giving somebody heavy machinery equipment and they try to operate it without reading the manual or going through proper training.

Although we are calling these “noob mistakes” in this discussion, but there are cases where we might actually believe that the testing and training data come from the same distribution, and later realize that we were wrong, and then maybe we can fix that with "regularization" or gathering more data. But I still believe if the distributions are the same and we are doing unreasonably well on the training data, but not on the rest of the distribution, we can just call that “overfitting”.

Expand full comment