arg min

Yes, and I’d take it a step further: *all* categories are loaded.

Expand full comment

Maxim Raginsky

Relevant: https://issues.org/limits-of-data-nguyen/

Expand full comment

Mario Pasquato

Feb 25

I think I love your blog. By the way what construct is training set accuracy tied to? Clearly not the same as test set accuracy, otherwise we would have no need for a test set.

Expand full comment

Feb 25

Yes, this is a great question! We really need to start there.

Expand full comment

Nico Formanek

Extrapolating from your graph, could you conclude that no matter how hard you optimize your models on ImageNet, they will never generalize to ImageNetV2 or ObjectNet?

Expand full comment

With ImageNet training data alone, yes, I think it will be impossible to get perfect accuracy on these other benchmarks, no matter how good the ImageNet test error. However, I am not ruling out that other training modalities that use additional data sources could get high accuracy on all three data sets.

As Vaishaal Shankar put it: if you want better test set accuracy, don't train on the training set.

Expand full comment

Badri

Feb 25Edited

Q1: Could it be that construct validity simply manifests as measurement noise which is small enough in aggregate?

Q2: What if we considered problems where subjectivity in labels is low, like MNIST? May be clean it even more like so: https://cleanlab.ai/blog/label-errors-image-datasets/ . If not that, may be we can take a short horizon forecasting problems with some leading indicators as features and regress on some measurable quantity (precipitation, temperature) or classify those into some ordinal categories (sometimes even binary) classes?

I ask because I'd love to understand the surprising robustness of test set without invoking construct validity (acknowledge that annotation is hard, subjective, in general and it is not well appreciated) unless that is an important explainer for this mystery. Another possible distraction is connection to causality - what if we humbly admit that the goal of prediction is to merely interpolate missing data and we will exercise good judgment (rare in practice) to not making sweeping causal conclusions?

Expand full comment

https://www.argmin.net/p/the-war-of-symbolic-aggression

Feb 26Edited

Yes, this is a great question: what are the best data sets that are representative of our actual ML experience but not laden with complex construct validity issues? what would be a good “synthetic data set?”

The issue with synthetic datasets is that if you can write f(x) in a few lines of code, then it’s not a real machine learning problem.

Maybe MNIST is the right starting point!

Expand full comment

Reply (2)

Nico Formanek

Feb 27

"The issue with synthetic datasets is that if you can write f(x) in a few lines of code, then it’s not a real machine learning problem."

I think there is some subtlety glossed over here. If you believe that there are constructs, then these constructs cannot be defined implicitly via test sets. There must be some intensional (perhaps even computable) definition of these constructs, although I'm not sure if it must be short.

Expand full comment

Feb 27

yes indeed. I wrote about this more today. I'm still working through the subtleties. https://www.argmin.net/p/nomological-networks

Expand full comment

Badri