No one can explain why or when statistics generalize and transfer. Statistics tries to sell us stories like this:
You go to the Berkeley Bowl and look at all of the apples. You note that 90 percent of the apples are red. From this tally, you expect that about 90 percent of the apples at the Monterey Market will be red.
Statistics argues that rates of some events should be the same in different contexts. In which contexts? Why do we believe in this extrapolation of rates? After years of struggling with this, I learned to stop worrying. Statistics is predicated on believing in transfer. Transference must be declared either implicitly or explicitly as a model of reality. We can reason statistically only if we agree that rates extrapolate. We can reason statistically only if we agree that the future will be similar to the past.
I’ve been trying to figure out when people started accepting that past rates would tell us something about future rates. Statisticians have been grappling with this from the beginning, and none have good answers. I’ve been particularly stuck on a quote by mathematical statistician M. S. Bartlett from 1951:
“In so far as things are similar and definite ... they can be counted and new statistical facts are born ... Our arithmetic is useless unless we are counting the right things”
What are those right things? The term of art in experimental science for when statistics collected in one setting apply to another is called external validity. (Long-time readers know I’ve written about several times on this blog, for example here, here, and here’s a guest post by Deb Raji too).
External validity contrasts against internal validity, when statistical tests on a particular sample population are valid. Interval validity usually comes down to calculations for hypothesis testing or confidence intervals based on intentional experimenter randomness. This is statistical bread and butter. But having read too many papers on external validity, distribution shift, domain shift, and related buzzy topics, I am unconvinced that statistics can say anything useful about external validity. External validity isn’t a math problem. External validity is a problem of prediction, and, though I get into trouble every time I say this, there is no evidence that statistics can say anything useful about prediction unless we model the future as being exactly like the past.
Don’t get me wrong, this isn’t a terrible model! The sun will come up tomorrow. If the future and the past are similar, it’s good to have statistical forecasting tools. But when the future and the past are different, all bets are off. So we have to carefully examine our assumptions around why we believe results transfer.
In my search through the literature on external validity, I read a compelling survey by political scientists Findley, Kikuta, and Denly that lays out a checklist for external validity. Notably, nothing in their framework is mathematical. They call their checklist M-STOUT, for Mechanisms, Settings, Treatments, Outcomes, Units, and Time. Let me go through these:
Settings: Is the physical environment of the original statistical study the same as which its being extrapolated to? If the original study is on a private college campus, does it apply to first grade class in a rural elementary school?
Treatments: Does the treatment mean the same thing in the new context? For instance, did we update the formula of a drug?
Outcomes: Are we measuring the same outcome? For instance, are we directly surveying individuals or relying on remote estimates?
Units: Are the individual units we’re examining the same as those in the original study? Does a medical study of sick people extrapolate to a healthy cohort?
Time: Has too much time elapsed so we’d expect a significantly different set of conditions than when we estimated the original statistic?
Mechanisms: I saved this one for last because it’s the most important. Do we believe the statistic is plausible, and do we have strong conceptual models for why we think the statistic is robust and repeatable?
If these look mostly the same, you should hopefully get similar outcomes. But how similar do the M-STOUT have to be in order to see similar outcomes? It’s impossible to know. I’ve been surprised by this time and time again. My group has done replication studies in machine learning where we tried our hardest to collect new data really close to the old data, and yet we see a big difference in model performances.
Findley, Kikuta, and Denly accept that predicting transportability from statistics is a fool’s errand. Instead, they argue that external validity be examined through conceptual modeling of mechanisms and plausibility arguments for the scope of the repeatability of results. They are siding with Bartlett. Statistics is powerful when counting the right things, but you’ll need some other philosophy to know the right things to count.
John Venn wrote 'The Logic of Chance' in 1866 where he mentions statistical regularity increases as you collect more data over time, but that the regularity will break down after enough time has passed. This was in the context of lifespan and heights.
Forgive me if this sounds ignorant, but in CS281A, we had about 2 lectures (maybe more?) going over generalization error in machine learning. From what I recall (and also what wikipedia defines generalization error as), it is supposed to be "a measure of how accurately an algorithm is able to predict outcome values for previously unseen data". And of course, I remember you did not speak very highly of the conventions that arise from the machine learning community as a result of some of those theories. In this blog, however, you argue external validity seems to be moreso a philosophical question based on your beliefs attached to a strong conceptual model, which as a result appear more qualitative than quantitative. So my question and asking for your opinions/thoughts on this is: what did we learn from generalization error that is useful for us to takeaway when we utilize machine learning algorithms in real world applications given this blog seems to handwave that it was never all that useful?