Hermeneutics of crapshoots
How many times do I need to see something to believe it?
One of the odder quirks of ritualized statistical inference is its veneration of combinatorics as the arbiter of belief. The “null hypothesis” is that the world is nothing but a bunch of coin flips. Validity is a long enough run of heads.
The bizarre obsession with hot streaks starts, of course, with Ronald Fisher. His tea lady experiment, the core motivating case study in his book “The Design of Experiments,” is maddening. Algal researcher Muriel Bristol told Fisher she could taste the difference between when milk or tea was added first to the cup. He challenged her to a test in which he poured 8 cups of tea: 4 milk first, 4 tea first. He mixed up the order and presented them to Bristol. She correctly identified all 8. Perfect accuracy. But Fisher explains in excruciating detail that the p-value of the null hypothesis that she guessed a set of 4 cups uniformly at random is only 1/70. That’s 0.014. I guess that gets two asterisks in our table. But who is convinced by two asterisks?
So sure, if Fisher had forced the poor scientist to taste 16 cups instead of 8, we’d have more confidence in her palate. I have no doubt she’d have labeled all 16 cups correctly, and Fisher would have gone to his abacus and come back with a p-value less than 0.0001. And then what? There’s still a chance she’s a random number generator. Let’s do 32 cups! Statistics asks us to fixate on combinatorics and forget that Muriel labeled all of the cups correctly. Ad infinitum, we never prove anything. This is where we sit today, where any contrarian is granted the rhetorical maneuver of “oh, your study was underpowered.”
What’s so frustrating is how the combinatorial mindset refuses to grapple with actual cause and effect. The attitude that combinatorics alone determines valid inference infantilizes epistemic inquiry. How does Muriel taste the difference in tea preparations? What cues does she use? What does this say about her personal tastes? Why does she put up with assholes like Ronald Fisher? The proper answer here is to realize that adding milk to tea is gross and to go visit a nice Persian cafe.1
This combinatorial thinking muddies the literature of the human-facing sciences. Every person becomes a statistic. The best evidence becomes a scatter plot. I was reminded of this when scouring the LEAP study for numbers. When the investigators wrote about subsamples of the population, they tended to present percentages rather than counts. For example, writing “3.2% of the treatment group developed allergy” instead of “10 children in the treatment group developed allergy.” Only one child is counted in their per-protocol analysis as definitely having consumed peanuts and developed some sort of allergic reaction at age 5. The only thing we know about them is that they were 0.3% of the treatment group. Maybe there’s something unique about this individual. I want to know what made them different. I can’t find the particulars of that anywhere in the paper or appendices. Perhaps the secret lies in the trial data, which I would need to fill out several forms to access.
It shouldn’t be mutually exclusive to report trial statistics and detail individual cases, but that’s the convention we’ve converged to. House rules force us to reason only about relative proportions. Paper length restrictions leave us only with statistical shadows. Privacy laws prevent us from actually seeing the details of what happened. We are left ignorant of the other aspects of causality and forced to make sense of relative proportions. Once you’ve seen the standardized rhetorical device enough times, it becomes unconvincing. If you bring this up, you are cast out as a heretic.
As I reiterated on Monday, medicine is the only human-facing science that routinely finds 5-sigma interventions. What run of success do you need to be convinced a treatment works? The house rules sadly limit the statistician’s role to mindlessly arbitrating sample sizes using an implausible model of the universe. What sample size would convince you that Muriel Bristol is a tea connoisseur? What sample size do you need to convince yourself that penicillin cures bacterial infections? That chemotherapy cures childhood leukemia? That ibuprofen relieves headaches? That kidney transplants are effective? That peanut consumption prevents peanut allergy? Randomized trialomania is a baffling mindset where we are only convinced a treatment is effective if we randomly split a sufficiently large group in two.
What sufficiently large means is the only question we seem willing to ask. 8 cups of tea, 640 children for the LEAP trial. You could argue that this 640 number is far too large. That is, the LEAP study was overpowered. The large trial size harmed the control group. A study with 200 subjects would have found convincing results with p<0.01. The trial size also clouded our understanding. Statistical rules forced us to count 7 children who were immediately excluded after randomization. The administrative burden of managing the trial meant that a dozen children were lost to follow-up. Might we have learned more, and the trial subjects been better off with a careful study of 10 children than an aggregate look at 640?
My friend Tara is horrified by the idea of adding milk to tea.


"Why does she put up with assholes like Ronald Fisher?" I laughed out loud; thank you for brightening my day.
It seems to be an almost ineradicable article of faith that you can compensate for lack of causal understanding by introducing huge amounts of intentional randomness and doing a bunch of combinatorics.