I'm reminded of an incisive argument in Kate Crawford's atlas of AI about classification: What does it mean to classify a person as a certain race? Or to say someone's expression indicates "happiness"? These are loaded categories.
I think I love your blog. By the way what construct is training set accuracy tied to? Clearly not the same as test set accuracy, otherwise we would have no need for a test set.
Extrapolating from your graph, could you conclude that no matter how hard you optimize your models on ImageNet, they will never generalize to ImageNetV2 or ObjectNet?
With ImageNet training data alone, yes, I think it will be impossible to get perfect accuracy on these other benchmarks, no matter how good the ImageNet test error. However, I am not ruling out that other training modalities that use additional data sources could get high accuracy on all three data sets.
As Vaishaal Shankar put it: if you want better test set accuracy, don't train on the training set.
Q1: Could it be that construct validity simply manifests as measurement noise which is small enough in aggregate?
Q2: What if we considered problems where subjectivity in labels is low, like MNIST? May be clean it even more like so: https://cleanlab.ai/blog/label-errors-image-datasets/ . If not that, may be we can take a short horizon forecasting problems with some leading indicators as features and regress on some measurable quantity (precipitation, temperature) or classify those into some ordinal categories (sometimes even binary) classes?
I ask because I'd love to understand the surprising robustness of test set without invoking construct validity (acknowledge that annotation is hard, subjective, in general and it is not well appreciated) unless that is an important explainer for this mystery. Another possible distraction is connection to causality - what if we humbly admit that the goal of prediction is to merely interpolate missing data and we will exercise good judgment (rare in practice) to not making sweeping causal conclusions?
Yes, this is a great question: what are the best data sets that are representative of our actual ML experience but not laden with complex construct validity issues? what would be a good “synthetic data set?”
The issue with synthetic datasets is that if you can write f(x) in a few lines of code, then it’s not a real machine learning problem.
"The issue with synthetic datasets is that if you can write f(x) in a few lines of code, then it’s not a real machine learning problem."
I think there is some subtlety glossed over here. If you believe that there are constructs, then these constructs cannot be defined implicitly via test sets. There must be some intensional (perhaps even computable) definition of these constructs, although I'm not sure if it must be short.
This post of yours is now my new favorite! You have demonstrated by simply asking thought provoking questions how pervasive "construct validity" concerns are - quite unsettling that we are unable to find great datasets. This is why the post is going to hit the right note with a super diverse audience.
I also take back MNIST being a great example because it only addresses part of the problem in the following decomposition:
(1) Subjectivity of labels: Would two independent annotators agree with the labels? There is some sketchy theory on IRR but really this is also a definition or construct validity problem. Now, with MNIST, this is very low as humans agree on what each digit is and if some examples are indistinguishably bad, we can just eliminate those - they are useless.
(2) Coverage of labels: Take a claim like saying I have solved "vision". This would be very hard. Scaling to so many categories is very hard. There isn't even a simple "categorization" and it is not clear what would be a right ontology to express concepts we mean when we translate our language for expressing what we see into examples. Multimodal training seems good, but has ways to go. Ultimately, how will we test them unless we curate high quality large test sets which are the real drivers of progress? Again, in MNIST, this is settled - there are only 10 digits!
(3) Coverage of conditions for labels: THIS is the hardest of them all. My favorite example is from https://leon.bottou.org/talks/2challenges (ICML 2015 talk) page 56 where Léon shows a picture of a car in the pool and asks if it is less of a car since the "context" is wrong. I believe it is still possible to create MNIST examples that will degrade this sort of performance because our mental specification of what digits are may not be satisfied by the finite set of examples we provide.
Both (1) & (3) are construct validity questions that arise from our inability to define what we precisely mean.
Perhaps it is a fool's errand to pretend that we can overcome our inability to precisely define what we mean ("align" to human preferences) by just creating more examples. There is a lot more to unpack about this post and ask follow ups, but will do in separate comments.
Much gratitude from your eternal student for your regular blog posts - pretty much the ML equivalent of binge-worthy Netflix shows!
Great post. Related to the discussion in computer vision about "What constitutes a category?" (and, of course, prior philosophical investigation of this question). I wonder how much it would help to model the data-generating process (or at least the label-generating process) during model fitting? One of my all-time favorite papers is The Multidimensional Wisdom of Crowds (https://papers.nips.cc/paper_files/paper/2010/hash/0f9cafd014db7a619ddb4276af0d692c-Abstract.html) where they introduce a latent variable model to account for such phenomena as (a) how the labeler interprets the category definitions, (b) what aspects of the image the labeler is attending to, and (c) whether the labeler is cooperative or adversarial.
I'm reminded of an incisive argument in Kate Crawford's atlas of AI about classification: What does it mean to classify a person as a certain race? Or to say someone's expression indicates "happiness"? These are loaded categories.
Yes, and I’d take it a step further: *all* categories are loaded.
Relevant: https://issues.org/limits-of-data-nguyen/
I think I love your blog. By the way what construct is training set accuracy tied to? Clearly not the same as test set accuracy, otherwise we would have no need for a test set.
Yes, this is a great question! We really need to start there.
Extrapolating from your graph, could you conclude that no matter how hard you optimize your models on ImageNet, they will never generalize to ImageNetV2 or ObjectNet?
With ImageNet training data alone, yes, I think it will be impossible to get perfect accuracy on these other benchmarks, no matter how good the ImageNet test error. However, I am not ruling out that other training modalities that use additional data sources could get high accuracy on all three data sets.
As Vaishaal Shankar put it: if you want better test set accuracy, don't train on the training set.
Q1: Could it be that construct validity simply manifests as measurement noise which is small enough in aggregate?
Q2: What if we considered problems where subjectivity in labels is low, like MNIST? May be clean it even more like so: https://cleanlab.ai/blog/label-errors-image-datasets/ . If not that, may be we can take a short horizon forecasting problems with some leading indicators as features and regress on some measurable quantity (precipitation, temperature) or classify those into some ordinal categories (sometimes even binary) classes?
I ask because I'd love to understand the surprising robustness of test set without invoking construct validity (acknowledge that annotation is hard, subjective, in general and it is not well appreciated) unless that is an important explainer for this mystery. Another possible distraction is connection to causality - what if we humbly admit that the goal of prediction is to merely interpolate missing data and we will exercise good judgment (rare in practice) to not making sweeping causal conclusions?
Yes, this is a great question: what are the best data sets that are representative of our actual ML experience but not laden with complex construct validity issues? what would be a good “synthetic data set?”
The issue with synthetic datasets is that if you can write f(x) in a few lines of code, then it’s not a real machine learning problem.
https://www.argmin.net/p/the-war-of-symbolic-aggression
Maybe MNIST is the right starting point!
"The issue with synthetic datasets is that if you can write f(x) in a few lines of code, then it’s not a real machine learning problem."
I think there is some subtlety glossed over here. If you believe that there are constructs, then these constructs cannot be defined implicitly via test sets. There must be some intensional (perhaps even computable) definition of these constructs, although I'm not sure if it must be short.
yes indeed. I wrote about this more today. I'm still working through the subtleties. https://www.argmin.net/p/nomological-networks
This post of yours is now my new favorite! You have demonstrated by simply asking thought provoking questions how pervasive "construct validity" concerns are - quite unsettling that we are unable to find great datasets. This is why the post is going to hit the right note with a super diverse audience.
I also take back MNIST being a great example because it only addresses part of the problem in the following decomposition:
(1) Subjectivity of labels: Would two independent annotators agree with the labels? There is some sketchy theory on IRR but really this is also a definition or construct validity problem. Now, with MNIST, this is very low as humans agree on what each digit is and if some examples are indistinguishably bad, we can just eliminate those - they are useless.
(2) Coverage of labels: Take a claim like saying I have solved "vision". This would be very hard. Scaling to so many categories is very hard. There isn't even a simple "categorization" and it is not clear what would be a right ontology to express concepts we mean when we translate our language for expressing what we see into examples. Multimodal training seems good, but has ways to go. Ultimately, how will we test them unless we curate high quality large test sets which are the real drivers of progress? Again, in MNIST, this is settled - there are only 10 digits!
(3) Coverage of conditions for labels: THIS is the hardest of them all. My favorite example is from https://leon.bottou.org/talks/2challenges (ICML 2015 talk) page 56 where Léon shows a picture of a car in the pool and asks if it is less of a car since the "context" is wrong. I believe it is still possible to create MNIST examples that will degrade this sort of performance because our mental specification of what digits are may not be satisfied by the finite set of examples we provide.
Both (1) & (3) are construct validity questions that arise from our inability to define what we precisely mean.
Perhaps it is a fool's errand to pretend that we can overcome our inability to precisely define what we mean ("align" to human preferences) by just creating more examples. There is a lot more to unpack about this post and ask follow ups, but will do in separate comments.
Much gratitude from your eternal student for your regular blog posts - pretty much the ML equivalent of binge-worthy Netflix shows!
Great post. Related to the discussion in computer vision about "What constitutes a category?" (and, of course, prior philosophical investigation of this question). I wonder how much it would help to model the data-generating process (or at least the label-generating process) during model fitting? One of my all-time favorite papers is The Multidimensional Wisdom of Crowds (https://papers.nips.cc/paper_files/paper/2010/hash/0f9cafd014db7a619ddb4276af0d692c-Abstract.html) where they introduce a latent variable model to account for such phenomena as (a) how the labeler interprets the category definitions, (b) what aspects of the image the labeler is attending to, and (c) whether the labeler is cooperative or adversarial.
Liked thanks 👍 p.s. ricky https://rickycasino.pokiesreal.money/
I loved this post! I think this also speaks to why post-training LLMs to follow instructions is so hard.