This article comes at the perfect time. Leamer's metaphor is truely spot-on for ML. I wonder if these 'sins' aren't often just pragmatic solutions pushing theory forward.
This is an excellent post! I know you've been writing about these points for a few years, but I thought you did a particularly clear and convincing job with this writeup. There is so much for the ML community to think about here! Thank you
Thought: this seems like the same type of issue that requires instrument calibration, integral feedback control, or channel estimation / adaptive filtering, depending on context to get acceptable performance in the real world. Is it really that mysterious?
I mean, that it makes sense to get more data and retrain, I certainly agree. But there's some weird things in these scatterplots that I still don't understand as much as I'd like to.
I don't think the phrase "Training on the test set does not lead to overfitting" correctly describes what you're trying to say. If you trained on the test set, you'd get close to 100% accuracy on the test set. But you wouldn't get 100% accuracy on IID data. By definition, that would be overfitting.
I think a better way of describing it would be the phrasing you used in your paper: "This shows that the current research methodology of “attacking” a test set for an extended period of time is surprisingly resilient to overfitting."
Has it ever made sense to train on the training data? From a practical perspective, the data you test on will never ever be a random subset of the data you collected to train on. You only operate on data you collect after the model is trained. All training data is "pretraining" data!
I noticed in some of your class slides that in the original NIST digits data the test and train came from different populations. Why isn't that standard practice to do that anymore? It seems to me we want to see at least some basic level of generalizibility.
Yes, and in the last few years it has become "best practice" or at least one to aspire to. Most LLM work, for example, seems to be solidly in this direction.
The issue is just that this requires larger models and larger data sets, which necessitates an arms race in computation and puts good practice out of the reach of most teams.
This is why I think some of the most pressing problems facing non-industrial machine learning research are facilitating open data sets and machine-learning systems optimization.
I really like the ImageNet generalization paper, ofc. I think it's interesting that the models have their ranks preserved from S to S', ie the adaptivity gap is small. I think the discussion in the Sec 5.1 "Limited Model Class" paragraph makes sense. But to the extent that you can train the *same architecture* with different random seeds and get different accuracies, and then pick the best random seed based on S, would those gains also generalize to S'?
In the ImageNet replication experiment was a validation set used to stop training before over-fitting? If so, maybe the validation set was too small and the models are undertrained? That'd explain why they benefit from further training.
In our replication experiment, we downloaded pre-trained imagenet models and evaluated them. We did not train any models ourselves when making that plot.
This article comes at the perfect time. Leamer's metaphor is truely spot-on for ML. I wonder if these 'sins' aren't often just pragmatic solutions pushing theory forward.
This is an excellent post! I know you've been writing about these points for a few years, but I thought you did a particularly clear and convincing job with this writeup. There is so much for the ML community to think about here! Thank you
Thanks Tom!
Thought: this seems like the same type of issue that requires instrument calibration, integral feedback control, or channel estimation / adaptive filtering, depending on context to get acceptable performance in the real world. Is it really that mysterious?
I mean, that it makes sense to get more data and retrain, I certainly agree. But there's some weird things in these scatterplots that I still don't understand as much as I'd like to.
I don't think the phrase "Training on the test set does not lead to overfitting" correctly describes what you're trying to say. If you trained on the test set, you'd get close to 100% accuracy on the test set. But you wouldn't get 100% accuracy on IID data. By definition, that would be overfitting.
I think a better way of describing it would be the phrasing you used in your paper: "This shows that the current research methodology of “attacking” a test set for an extended period of time is surprisingly resilient to overfitting."
I agree, but "training on the test set" has been the term of art since at least the 1960s.
https://www.argmin.net/p/benchmarking-our-benchmarks
Has it ever made sense to train on the training data? From a practical perspective, the data you test on will never ever be a random subset of the data you collected to train on. You only operate on data you collect after the model is trained. All training data is "pretraining" data!
I noticed in some of your class slides that in the original NIST digits data the test and train came from different populations. Why isn't that standard practice to do that anymore? It seems to me we want to see at least some basic level of generalizibility.
Yes, and in the last few years it has become "best practice" or at least one to aspire to. Most LLM work, for example, seems to be solidly in this direction.
The issue is just that this requires larger models and larger data sets, which necessitates an arms race in computation and puts good practice out of the reach of most teams.
This is why I think some of the most pressing problems facing non-industrial machine learning research are facilitating open data sets and machine-learning systems optimization.
I really like the ImageNet generalization paper, ofc. I think it's interesting that the models have their ranks preserved from S to S', ie the adaptivity gap is small. I think the discussion in the Sec 5.1 "Limited Model Class" paragraph makes sense. But to the extent that you can train the *same architecture* with different random seeds and get different accuracies, and then pick the best random seed based on S, would those gains also generalize to S'?
In the ImageNet replication experiment was a validation set used to stop training before over-fitting? If so, maybe the validation set was too small and the models are undertrained? That'd explain why they benefit from further training.
In our replication experiment, we downloaded pre-trained imagenet models and evaluated them. We did not train any models ourselves when making that plot.