I love the revised cifar plot and I think about it a lot.
One dumb phenomenological way of thinking about curves like that is assume we can predict the pass rate of a model from two factors: an intrinsic model capability and an intrinsic problem difficulty. Then if you assume a dumb P(model on problem) = sigmoid(capability - difficulty) and if you approximate the sigmoid with a linear function you should get this type of behavior where as you look at the ensemble of models on a fixed problem set you'll see a line, and switching problem sets the line will have different slopes all meeting at 100% accuracy. This doesn't explain why the revised cifar is reliably harder than the original cifar, however. But it does explain behavior I've seen in LLMs where broadly speaking a large number of unrelated benchmarks evaluated over many models have a PCA of surprisingly low dimension, and so you're better off picking a small number of metrics to look at and ignoring the rest.
I think this is how IQ was first defined?
That said I still cannot explain why the new line is lower than the old line.
100%! Someone really needs to do this experiment. That is, building an Item Response Theory model for scatterplot data. That person should really be me. Shoot, I need to find a collaborator...
I bet this would explain a lot of what we're seeing. I don't know whether this was the methodology used for IQ, but it was the model ETS used to develop the modern standardized test.
If this model even predicts a linear relationship, I'd be satisfied.
We discuss a model like this in Section 5.3 of the ImageNetV2 paper (https://arxiv.org/pdf/1902.10811). While it also shows a linear trend, the instance difficulty distribution wasn't Gaussian on the actual ImageNet data when I tried to fit that model to data (e.g., there were more images correctly or incorrectly labeled by all models than the Gaussian assumption suggested). Maybe there are other item response theory models that fit the observed data better. I'd be quite curious about this!
Regarding the new CIFAR-10.1 and ImageNetV2 datasets being harder: my best explanation is that there is a trade-off between dataset hardness and label quality. If you have a lot of medium-quality annotations, a good way to get correct labels is to take only the images with high agreement from your labelers. Those will indeed be correctly labeled images, but this process also removes correctly labeled but harder images that some annotators label incorrectly. For the reproductions we put more effort into ensuring label quality, so ended up with more hard images than the original datasets.
In ecology, our model often has two parts: an observation model and a biological model. For example, in species distribution models, we want to know whether a particular location (described by a vector of "habitat covariates") is occupied by a particular species. This could be viewed as a simple classification problem: f(habitat) = P(occupied | habitat)}. However, our observations are made by humans who visit the location and spend some amount of effort looking to see if the site is occupied. The probability that they will detect the species given that the site is occupied depends on a set of observation covariates that may include some habitat covariates (density of shrubs) as well as covariates for effort and weather (and possibly, degree of observer skill). g(obscovariates) = P(detection | site is occupied). The likelihood function is therefore something like P(detection | site is occupied) * P(occupied | habitat). This is known as the Occupancy Model, and we need to estimate the parameters of both f and g from the data. This estimation is quite delicate, because there are trivial solutions (e.g, all sites are occupied, and all negative observations are due to low detection probability; or detection probability is 1.0 and all negative observations are due to bad habitat).
Two questions: First, is it useful to view this as an extension of your "design-based ML" to include a measurement model? Second, I suspect that most ML analyses should include an explicit measurement model. We are accustomed to just dumping all of the covariates into a system and estimating h(obscovariate, habitatcovariates), but this loses the causal structure of the observation process.
With regards to reasoning, how well is that going> Yes, scaling has value, but in which domains does it work well?
In the 1990s, Keith Devlin tried to add time and location constraints to logic programming, which was in some ways similar to the police use of alibi and opportunity. (Devlin eventually abandoned his logic approach.) Judea Pearl pushed for causality, although it seems to me that it is adjacent to [Hidden] Markov Models, but with human involvement in deciding what can and cannot cause an event. (I don't know whether this has been solved with ANN architectures rather than symbolic coding.
My playing with LLMs well behind teh frontier models suggests, anecdotally, that logic works OK on fairly simple, abstract concepts, but fails when "causality" is needed. Gary Marcus goes further with the need for world models.
What is teh state of play with regards these issues with LLM reasoning models? Are they really reasoning as we assume human minds reason, or are they doing something very different? Is it fixable with ANN architectures, or do we need to integrate something else?
I love the revised cifar plot and I think about it a lot.
One dumb phenomenological way of thinking about curves like that is assume we can predict the pass rate of a model from two factors: an intrinsic model capability and an intrinsic problem difficulty. Then if you assume a dumb P(model on problem) = sigmoid(capability - difficulty) and if you approximate the sigmoid with a linear function you should get this type of behavior where as you look at the ensemble of models on a fixed problem set you'll see a line, and switching problem sets the line will have different slopes all meeting at 100% accuracy. This doesn't explain why the revised cifar is reliably harder than the original cifar, however. But it does explain behavior I've seen in LLMs where broadly speaking a large number of unrelated benchmarks evaluated over many models have a PCA of surprisingly low dimension, and so you're better off picking a small number of metrics to look at and ignoring the rest.
I think this is how IQ was first defined?
That said I still cannot explain why the new line is lower than the old line.
100%! Someone really needs to do this experiment. That is, building an Item Response Theory model for scatterplot data. That person should really be me. Shoot, I need to find a collaborator...
I bet this would explain a lot of what we're seeing. I don't know whether this was the methodology used for IQ, but it was the model ETS used to develop the modern standardized test.
If this model even predicts a linear relationship, I'd be satisfied.
We discuss a model like this in Section 5.3 of the ImageNetV2 paper (https://arxiv.org/pdf/1902.10811). While it also shows a linear trend, the instance difficulty distribution wasn't Gaussian on the actual ImageNet data when I tried to fit that model to data (e.g., there were more images correctly or incorrectly labeled by all models than the Gaussian assumption suggested). Maybe there are other item response theory models that fit the observed data better. I'd be quite curious about this!
Regarding the new CIFAR-10.1 and ImageNetV2 datasets being harder: my best explanation is that there is a trade-off between dataset hardness and label quality. If you have a lot of medium-quality annotations, a good way to get correct labels is to take only the images with high agreement from your labelers. Those will indeed be correctly labeled images, but this process also removes correctly labeled but harder images that some annotators label incorrectly. For the reproductions we put more effort into ensuring label quality, so ended up with more hard images than the original datasets.
In ecology, our model often has two parts: an observation model and a biological model. For example, in species distribution models, we want to know whether a particular location (described by a vector of "habitat covariates") is occupied by a particular species. This could be viewed as a simple classification problem: f(habitat) = P(occupied | habitat)}. However, our observations are made by humans who visit the location and spend some amount of effort looking to see if the site is occupied. The probability that they will detect the species given that the site is occupied depends on a set of observation covariates that may include some habitat covariates (density of shrubs) as well as covariates for effort and weather (and possibly, degree of observer skill). g(obscovariates) = P(detection | site is occupied). The likelihood function is therefore something like P(detection | site is occupied) * P(occupied | habitat). This is known as the Occupancy Model, and we need to estimate the parameters of both f and g from the data. This estimation is quite delicate, because there are trivial solutions (e.g, all sites are occupied, and all negative observations are due to low detection probability; or detection probability is 1.0 and all negative observations are due to bad habitat).
Two questions: First, is it useful to view this as an extension of your "design-based ML" to include a measurement model? Second, I suspect that most ML analyses should include an explicit measurement model. We are accustomed to just dumping all of the covariates into a system and estimating h(obscovariate, habitatcovariates), but this loses the causal structure of the observation process.
With regards to reasoning, how well is that going> Yes, scaling has value, but in which domains does it work well?
In the 1990s, Keith Devlin tried to add time and location constraints to logic programming, which was in some ways similar to the police use of alibi and opportunity. (Devlin eventually abandoned his logic approach.) Judea Pearl pushed for causality, although it seems to me that it is adjacent to [Hidden] Markov Models, but with human involvement in deciding what can and cannot cause an event. (I don't know whether this has been solved with ANN architectures rather than symbolic coding.
My playing with LLMs well behind teh frontier models suggests, anecdotally, that logic works OK on fairly simple, abstract concepts, but fails when "causality" is needed. Gary Marcus goes further with the need for world models.
What is teh state of play with regards these issues with LLM reasoning models? Are they really reasoning as we assume human minds reason, or are they doing something very different? Is it fixable with ANN architectures, or do we need to integrate something else?