I didn’t intend to write Monday’s post about reproducibility. The authors have replied in the comments of that post, and they make reasonable rebuttals. Read what Andrew, Erik, and Simon have to say.
But let me return to what I wanted to explore in the first place. My beef was less with the authors and more with a general big data mindset that confuses me (and everyone else, too). Why do we believe that scraping data repositories into small statistical summaries tells us anything useful about how to interpret future outcomes? I’ll use the NEJM Evidence paper as an example in what follows, but I’m not critiquing the authors today! I’m after a more general question about what we learn from data mining.
Remember, in the NEJM Evidence paper and the predecessors, the authors downloaded a bunch of data from the Cochrane Database that listed outcome numbers in 23,500 RCTs.They used various formulas to convert these numerical outcomes into z-scores. They made a histogram of the z-scores. They fit a mixture of Gaussians to this histogram. This process reduced the Cochrane database to a probability distribution defined by eight parameters. They then made inferences by analyzing the fit probability distribution (at this point, the data could be discarded entirely). They conclude that their results implied properties of studies that were “exchangeable with those in the Cochrane database.”
What does it mean for a data point to be exchangeable with another data point? A probability distribution on a sequence of data is exchangeable if the distribution remains the same even if one completely shuffles the order of the data points. The distribution has to be the same under any ordering of the data. This is a perfectly reasonable mathematical definition, but when does it hold? You could reasonably argue that coin tosses, or dice rolls, or roulette rounds are exchangeable. Certain sources of randomness shouldn’t depend on the order in which they occur. But as we move beyond intentional sources of randomness, we carry the casino metaphor forward. What does it mean for a new observation of the world to be exchangeable with the data you scraped off the internet? This answer is never clear to me.
Let’s leave the NEJME paper behind and imagine a different experiment, which I adapt from an angry rant of Rudy Kalman. I am interested in understanding the decimal expansion of π. I collect the first ten million digits using a bespoke R script. I can then bin the data into a histogram, counting the number of times each digit occurs. I can then fit a categorical distribution to my digits. Now I conclude that I can study properties of the digits of π by sampling from this categorical distribution and making plots about the statistics.
Using this method, I would learn that all of the digits occur at equal rates. OK, but what does this inference say about other digits? I have learned something about digits that are exchangeable with a random digit in the first 10 million digits of π. But which digits are exchangeable with these digits? Is the billionth digit of π exchangeable with my data set? Is the 15th digit of the decimal expansion of the square root of 2 exchangeable with my data set?
It’s pretty clear that I have destroyed all of the information about what makes digits “π-like” by compressing them into a small summary histogram and then overanalyzing that histogram. But I can’t see how my silly example about the digits of π differs in kind from many Big Data studies that invoke exchangeability.
There’s something bizarre about scale and statistics. All of modern statistics flows from the central limit theorem. Loosely, this says that if I collect enough data, I see a tight bell curve around the mean of the quantity I aim to observe. Following this reasoning down the stream, we convince ourselves that if a data set is large enough, then our inferences are correct.
But this is a weird paradox: the larger the data set, the less you know about the data in the set. Once our data sets get large enough, it necessarily means that we don’t look at the individual units. It means we flatten experience into tiny summaries. We necessarily understand less about the data as the size of the database grows. And we forget that we only need tons of data when the effects we are chasing are tiny with respect to the natural variability. Chasing lots of tiny effects by crudely compressing all human experience might not be conceptually sound. I’d echo my friend Michel Accad and argue that the practical value of a statistical inference is inversely proportional to the size of the data set. I also liked how Leif Weatherby put it on Twitter: “scale does not increase reliability, obscuring content questions by stabilizing them around norms with no secure line to goals.”
Now, you might ask, “What else can we do?” I know that big data science is more popular than ever, but I’ve been arguing on this blog for a while that we’re overdue for a retreat back to small data. There’s plenty of good to come from embracing the complexity of experience and understanding how to learn from the singular.
On the other hand, my machine learning friends will chime that scale has been the most important driver of predictive progress. I don’t disagree. But this shows why we distinguish between prediction and inference in statistics, even though we often conflate these concepts when we start writing code. Pattern recognition and understanding are two different things.
As Leif Weatherby also adds, unfounded inferences have become “a condition of human experience now, because so many of our social and technological systems run on stats.” Since we can’t escape from them, we need to understand how to engage in a world steeped in stats. Understanding the limits of statistical prediction and how to operate within the confines of a world stratified by automated but unjustified statistical decisions are the hottest topics in data science.
You'll enjoy this pithy little chestnut from Charles Geyer:
"The story about n going to infinity is even less plausible in spatial statistics and statistical genetics where every component of the data may be correlated with every other component. Suppose we have data on school districts of Minnesota. How does Minnesota go to infinity? By invasion of surrounding states and provinces of Canada, not to mention Lake Superior, and eventually by rocket ships to outer space? How silly does the n goes to infinity story have to be before it provokes laughter instead of reverence?"
Read the whole thing, it's worth it: https://www.stat.umn.edu/geyer/lecam/simple.pdf
One idea for “what else can we do?” is here: https://doi.org/10.1016/j.tics.2022.05.008 In cognitive science we can treat each participant as a replication of an N=1 study, then formally quantify how this generalises to the population by estimating the population-level experimental replication probability (the prevalence): https://elifesciences.org/articles/62461 I think this approach could also be useful in other areas.