arg min

This is an excellent post! I know you've been writing about these points for a few years, but I thought you did a particularly clear and convincing job with this writeup. There is so much for the ML community to think about here! Thank you

Expand full comment

Thanks Tom!

Expand full comment

Erik

Thought: this seems like the same type of issue that requires instrument calibration, integral feedback control, or channel estimation / adaptive filtering, depending on context to get acceptable performance in the real world. Is it really that mysterious?

Expand full comment

I mean, that it makes sense to get more data and retrain, I certainly agree. But there's some weird things in these scatterplots that I still don't understand as much as I'd like to.

Expand full comment

Erik

Oct 18

Yes, the details seem complicated. What comes to mind is feedback control. Say you fit a dynamical systems model to your training set using good practice sys-id (rebranded as AI of course). This model has some bandlimit, going above that is overfitting (noise, uncertainty). Closing the loop above the bandlimit does not work. The model is supposed to work well below the bandlimit, if done right. Few people with practical experience in feedback control would think that the model will work reliably unless some tricks are added to also handle (lower bandwidth) disturbances (~ distribution shift, integral gain). These tricks are based on quickly adapting (not a "full retraining") to the situation in real time / to the "new" dataset. Is the idea here that there should be a way around this in ML? Maybe this analogy makes no sense. You are the expert on both topics. Let us know.

Expand full comment

Nick McGreivy

I don't think the phrase "Training on the test set does not lead to overfitting" correctly describes what you're trying to say. If you trained on the test set, you'd get close to 100% accuracy on the test set. But you wouldn't get 100% accuracy on IID data. By definition, that would be overfitting.

I think a better way of describing it would be the phrasing you used in your paper: "This shows that the current research methodology of “attacking” a test set for an extended period of time is surprisingly resilient to overfitting."

Expand full comment

https://www.argmin.net/p/benchmarking-our-benchmarks

I agree, but "training on the test set" has been the term of art since at least the 1960s.

Expand full comment

Katie R

Oct 16

Has it ever made sense to train on the training data? From a practical perspective, the data you test on will never ever be a random subset of the data you collected to train on. You only operate on data you collect after the model is trained. All training data is "pretraining" data!

I noticed in some of your class slides that in the original NIST digits data the test and train came from different populations. Why isn't that standard practice to do that anymore? It seems to me we want to see at least some basic level of generalizibility.

Expand full comment

Yes, and in the last few years it has become "best practice" or at least one to aspire to. Most LLM work, for example, seems to be solidly in this direction.

The issue is just that this requires larger models and larger data sets, which necessitates an arms race in computation and puts good practice out of the reach of most teams.

This is why I think some of the most pressing problems facing non-industrial machine learning research are facilitating open data sets and machine-learning systems optimization.

Expand full comment

Raj Movva

Oct 16

I really like the ImageNet generalization paper, ofc. I think it's interesting that the models have their ranks preserved from S to S', ie the adaptivity gap is small. I think the discussion in the Sec 5.1 "Limited Model Class" paragraph makes sense. But to the extent that you can train the *same architecture* with different random seeds and get different accuracies, and then pick the best random seed based on S, would those gains also generalize to S'?

Expand full comment

Nick

Oct 16

In the ImageNet replication experiment was a validation set used to stop training before over-fitting? If so, maybe the validation set was too small and the models are undertrained? That'd explain why they benefit from further training.

Expand full comment