The blog has been hyperfocused on optimization this quarter, but a confluence of campus happenings—a wonderful seminar by Deborah Mayo, conversations at the Simons Institute program on Generalization, and class lectures on statistics—has dragged me back onto my statistical hobbyhorse. Today’s post synthesizes my prepared remarks for the panel accompanying Mayo’s panel and my bridging thoughts between the two statistical topics in my convex optimization class, decision theory and experiment design.
Mayo’s lecture focused on her conception of Popperian of severe testing and the role of statistics in such tests. To quote her directly:
“We have evidence for a claim C only to the extent C has been subjected to and passes a test that would probably have found C flawed, just if it is.” - Deborah Mayo
This testing notion should be familiar to folks who followed along with my Meehl blogging earlier this year. In Meehlian language, the verisimilitude of claims can be tested by experiment, and we want to demonstrate a “damned strange coincidence” to convince ourselves that our claim is causally associated with our experiment. For falsificationists, the more surprising the experimental outcome, the more we are assured our claim resembles the truth.
Statistical testing is supposed to measure our surprise level. But it is not at all clear that this is what it does. I’m a big fan of Dan Davies’ re-popularization of cyberneticist Stafford Beer’s adage, “The Purpose of a System Is What It Does.” If, after 100 years of fighting and browbeating, we see that statistical testing consistently fails to be severe testing, then it’s pretty silly to keep teaching our students that statistical tests are severe tests. But then, what exactly do statistical tests do?
To answer these questions, we need to introduce a new character to the statistics wars. Statisticians and philosophers spend far too much time close-reading Fisher and Neyman and far too little reading Bradford Hill. Hill had learned a lot from Fisher and Neyman, but his methods were far more pragmatic and far less dogmatic. And his influence on the practice of statistics dwarfs that of Fisher and Neyman.
Though Hill is best known as an epidemiologist and the first to discover a causal link between smoking and cancer, there are no randomized clinical trials without Hill. Not only did he design the first randomized clinical trial, but he consulted with global health leaders, including American institutions like the FDA and National Cancer Institute (NCI), about reasonable guidelines for how trials should be conducted and should inform policy. His consultation was pivotal in the design and acceptance of randomization as a central part of the mammoth Salk Vaccine trial.
Hill’s view was that randomized trials were for informing policy. How should doctors find best practices? How should the FDA decide which drugs are harmful and helpful? Hill’s writing is full of philosophical thoughts about such questions, but he almost never talks about p-values. The p-values were to convince statisticians. For doctors, he emphasizes design and randomization as tools for bias mitigation.
To see why medicine needs some sense of statistics, it’s helpful to go back to the very beginning. Florey’s 1941 report on the clinical application of penicillin had 10 case studies. It was unambiguously clear that penicillin was a wonder drug for treating infection. Five years later, Hill would lead the first RCT to test the effectiveness of a different antibiotic, streptomycin, for treating tuberculosis. Hill’s trial had over 100 people in it. Of the 55 treated, only 4 perished. However, tuberculosis wasn’t uniformly lethal. Of the 52 in the trial assigned to bedrest alone, only 15 died. Though its effectiveness as a treatment was nowhere near as cut and dry as penicillin, streptomycin passed the initial test of being better than nothing. Such passing let us explore more effective treatments for TB, leading to our modern complex cocktails.
The 1946 streptomycin trial is a parable of the randomized trial. We only need them for treatments that don’t always work. Moreover, they help ensure safety as much as they do efficacy. That’s the main reason the FDA and NCI latched onto them in the first place.
Fisher would lament that Neyman and Pearson “confuse the process appropriate for drawing correct conclusions, with those aimed rather at, let us say, speeding production, or saving money.” It’s clear from his writing and practice that Hill decidedly sided with Neyman and Pearson.
I don’t know how you can look back on the past hundred years and think otherwise.
Statistics has only been useful for speeding production and perhaps for saving money. The idea that we could quantify verisimilitude by some means of inference (fiducial, bayesian, or whatever) has always been a philosophical puzzle blocked by Humean objection.
If you disagree, tell me your favorite examples of statistical methods being employed to definitively prove or disprove the truth of a theory. What are the grand discoveries that we wouldn’t have made without an understanding of null hypothesis testing? At some point, the Purpose of A System Is What It Does. The purpose of statistical testing in the sciences is getting papers published.
But in engineering, medicine, and beyond, statistics serves a useful, if flawed, regulatory role. It’s exceedingly helpful for quality control, safety calculations, and drug regulation. Perhaps the human-facing “sciences,” which are growing ever more wary of the power of RCTs for truth-finding, can think about RCTs in this regulatory context.
For statisticians, this means there’s a need for introspection about what the field is for. I’ve argued that we should take a Cartwrightian dappled view of probability (also here), and we should take a dappled view of statistics. Statistics asks many different kinds of questions, but we confuse our students because the methods often look the same. Are we trying to quantify the verisimilitude of a theory or assertion? Or are we trying to quantify the error in a measurement to aid decision making? We need to speak with clarity about this. I don’t want to blame anyone, but until we disambiguate use, we will have more dumb arguments about what p-value thresholds mean for the replicability of science.
So I’ll close a word to the folks who like to philosophize about statistics, whether they be philosophers, statisticians, or bloggers. We need less focus on Popper’s modus tollens and more on his piecemeal social engineering. We should think less about severe tests and more about the significance of low regulatory bars designed to prevent harm. For this, we’ll have to talk more about dynamics, action, and recourse. We need to better theorize and communicate how statistical thinking can help us figure out what works.
Isn't the right question which false discoveries were prevented through the application of statistics? That is, the purpose of statistics is to prevent (or at least reduce to say, 5%) the publication of false papers. Looking at published papers to assess this is not very informative. Even so, in my experience, because standard statistical practice fails to capture many sources of variation, it fails at much higher than the 5% rate.
More generally, as you have emphasized multiple times, passing a NHST only tells us there might be a non-random signal present. It doesn't tell us what the signal implies about our beliefs
Ben:
I've written a response on my blog errorstatistics.com:
https://errorstatistics.com/2024/10/22/response-to-ben-rechts-post-what-is-statistics-purpose-on-my-neyman-seminar/