How do you know so much about swallows?
A dialogue about the use and meaning of statistics with Ben Chugg
A few weeks ago, I had a super fun conversation with Ben Chugg and Vaden Masrani on their Increments podcast about the philosophy of statistical and probabilistic reasoning. As we were running out of time, Ben C. told me that he didn’t really entirely buy my assessment of the use-value of statistical testing that I put forward in A Bureaucratic Theory of Statistics, and he wrote a thoughtful blog following up on his thoughts. I was traveling when Ben posted and got delayed in replying. Since I’m so late, my reply ended up being longer than a comment, and I think it’s worth sharing here.
Ben C. and I agree more than we disagree on this topic. Let me clarify some of my points, which touch on several previous conversations on the blog, and also address some issues with his objections. We’ll see if the Bens can converge on a unified perspective.
I have three main points that I want to highlight.
Statistics is much more than statistical tests
We know that things work without using statistical tests.
The stories we tell ourselves about what the outcomes of statistical tests mean are tenuous.
Ben C. starts with the following summation of my position, which needs an important clarification:
“Recht’s view is that statistics is less in the business of truth-seeking than in the business of providing clear, transparent rules for decision-making in large organizations.”
I’d agree with this sentence if he replaced statistics with statistical null hypothesis tests. The extra few words matter. I concede that I have an idiosyncratic instrumentalist view of statistics that drives some statisticians crazy. I care about how statistical methods are used, not what statisticians say statistical methods are for. As a non-exhaustive list, statistics can be used as part of bureaucratic games, randomized algorithms, or forecasts. How we use statistics in each of those three contexts is, in fact, different. Even the same statistical methods can have different uses. Confidence intervals are valuable in randomized algorithms because you can check whether or not the returned answer is valid. You don’t have this option of verification after running an opinion poll and computing the margin of error. Thus, even though we apply very similar code—computing averages or standard deviations—the meaning and value of statistics have to be evaluated in context.
Statistics does itself a disservice by hyperfocusing on its weakest link, inference. This is part of why I wrote the Bureaucratic Statistics paper in the first place. We have no means of computationally divining truth, and we can make a stronger argument for the future of statistics by focusing on its more grounded use cases.
Now, Ben C. argues that the reason we like statistical tests at all is because of their inferential power. He states the statistician’s party line.
“Statistical testing exists as a mechanism to keep us from fooling ourselves. It’s the first line of defense against naive realism—the tendency to accept uncritically what your eyes or gut tell you.”
He asks, “Do we actually not know if asbestos or cigarettes are bad for you?” and says the only reason we know this is because of statistical inference.
The difference between asbestos and cigarettes is illuminating. No statistical tests were needed to determine that asbestos caused acute lung damage. People who worked with asbestos developed disease quickly, died young, and had clear pathological indicators of lung damage. The rates of cancer among asbestosis patients were alarmingly high. Sure, people were looking at “rates,” and you could say maybe they would have been better off putting error bars on those rates. They seemed to get along just fine, constructing an etiology of asbestos-related disease without relying on p-values or confidence intervals. No one was fooled by randomness in the case-driven assessment of a major health hazard.
Smoking, on the other hand, was a more slowly acting poison. Lung cancer and COPD take decades to develop in smokers. Some smokers don’t ever develop either. The incidences of lung cancer and COPD in long-term, older smokers are roughly 15% and 25% respectively. And plenty of people develop lung cancer and COPD without smoking. As Peter Attia documents in his neurotic Outlive, some centagenarians attribute their longevity to whiskey and cigarettes. With such long lead times and relatively low incidences, it was harder to tease out the harm of smoking than it was of asbestos.
We should give Richard Doll, Bradford Hill, and Richard Peto massive credit for their impressive long-term cohort study on smoking, which demonstrated a statistical cancer link. Thanks to them and decades of extra tabulation, we now know that smokers have more than a tenfold increased risk of lung cancer and a threefold increased risk of COPD.
On the other hand, epidemiologists got way over their skis with cohort studies. Everyone wants to find the new intervention that’s a cigarette: fats, carbs, ultra-processed foods. But the number of diseases with as unequivocal causal attributions as smoking-causes-cancer is zero. Smoking causes cancer is not just the most successful observational causal inference study ever done. It is the exception that proves the rule.
On the flip side, you also don’t need statistics to know many of our medicines are effective. There is no statistics in Florey’s work on penicillin that won him a Nobel prize. They literally saw infections recede upon the application of the treatment. There was no statistics needed to prove that chemotherapy cured leukemia in children. You don’t need statistics to know that heroin gets people high. I could list dozens of medical interventions where the effect size is overwhelmingly large and undeniable.
Again, as I discuss in Bureaucratic Statistics and in my forthcoming book, the reason we love RCTs in medicine is regulatory. RCTs provide a reasonable framework for the FDA to do its job: encourage pharmaceutical innovation while ensuring products are not poisonous and have the potential to help someone. The RCTs of the FDA, while expensive and onerous, are asking drug makers to clear a pretty low regulatory bar.
The regulatory function does not demand that drugs must work for everyone! We’re fine with keeping the uncertainty that drugs might not work for some people because we (ideally) want doctors to have the flexibility of creative care. Ben C. argues that statistical tests are adopted because we believe in their inferential properties. However, the application of statistical testing reveals that this isn’t true. We only use statistical tests when we’re uncertain and need some sort of consensus rule to move forward.
But why the RCT, then? Ben C. says I’d be unhappy if we picked a drug at random from the list of FDA applications and declared it legal without evidence. This is true. The FDA’s current testing protocols aren’t perfect, but I can make a strong case for them. Randomized drug trials are helpful because they mitigate biases, be they experimenter, participant, or lead-time. They test safety above all else, requiring logging adverse events and halting under warning signals. And their goal is to argue that the drugs might be therapeutically helpful to someone, not that they are curatives for everyone.
You can say that the last step is inferential, but I don’t think that we can infer much at all when Kaplan-Meier curves are close to each other. All statistical inference arguments are stories about correlations that link the outcome of experiments to some model of the world. What we believe about this link is far more about language convention than it is about computational epistemology.
Ben C. inserts a fun example at the end that illustrates my point about storytelling. In a famous scene from Monty Python’s The Holy Grail, Sir Bedevere helps resolve a village witch trial. The testimony from the village is problematic. A villager claims the defendant turned him into a newt, but he got better. Through the powers of logic and Socratic dialogue, Bedevere convinces the town that the best way to determine if the woman is a witch is to see if she weighs the same as a duck. The town is convinced, performs the test, and then gleefully whisks her away to be burned after she passes the duck test. As the mob carries her off, you can hear the woman sigh, “It’s a fair cop.”
I’m sorry, the allegory of Sir Bedevere is a perfect description of the subtle absurdity inherent in inferring from statistical testing. Statistical models are never true. We replace uncertainty with certainty by telling ourselves a convincing story that a single number is dispositive for some decision.1 Some stories are more reasonable than others. But even when we explicitly create the randomness and have binary outcomes and can exactly compute the p-value, we are still telling ourselves some inferential story about a metaphysical quantity, the average treatment effect. More often than not, our statistical stories are completely unreasonable, as is the case in almost all observational causal inference. And things are only getting worse. Have you guys heard about silicon sampling in social science? Yikes.
But we can come to consensus and decide together that some statistical stories are reasonable. This can help us break through tricky problems where we need to make decisions that impact multiple stakeholders with diverse interests and perspectives. We can come together and say if it looks like a duck, swims like a duck, and quacks like a duck...
Maybe I am too control-pilled (and I am), but my view is that statistical procedures produce open-loop decision rules, and any claim of inferential validity can only be made once the loop is closed by connecting the rules to the environment they were meant for.