Nothing's Shocking
Medicine is the only human-facing science that consistently finds five-sigma interventions. Change my mind.
One of my favorite aphorisms is that medicine is the only human-facing science that consistently finds five-sigma interventions. Whether it is over-the-counter pills like ibuprofen for fever reduction, or life-saving treatments like antibiotics or insulin therapy, there’s a surprising number of interventions in medicine that we know work for most people under most conditions. I have rarely found anyone who objects to my pithy little quip. However, I recently got into a back-and-forth with my colleague Will Fithian about it, and now I’m interested in testing its validity.
To do this, I need to be precise about what I mean by a five-sigma intervention. Colloquially, I mean “probably didn’t need a trial as the effect was so undeniable.” However, I can make it more precise for a reasonable scientific conversation. By intervention, I mean a clearly defined action with a clearly measurable outcome. I also mean that there is a clearly defined notion of not acting, and the same outcome can be measured under both action and inaction. We can imagine any intervention as having some metaphysical treatment effect, equal to the outcome under action minus the outcome under inaction. For any intervention as I’ve defined it, I can imagine an average treatment effect over a hypothetical population.
By 5-sigma, I am making a deliberate allusion to the normal distribution. The tails of the Gaussian beyond 5 standard deviations from the mean have probability mass of around 6 x 107. 5 sigma suggests “surprising” or “definitive,” and that something didn’t happen “by chance.” This gives me a working definition:
A 5-sigma intervention is one with an estimated average treatment effect that passes a well-stated statistical test with a p-value of less than 1 in a million.
As evidence that 5 sigma interventions were more common than I put on, Will sent me the osf materials from the paper “Estimating the reproducibility of psychological science” by “The Open Science Collaboration,” a team of around 300 researchers.1 This paper, cited over 10000 times, is a seminal document in the replication crisis panic in psychology in the 2010s. They replicated 100 experiments published in 2008 and found they could reproduce less than half of the effects. By contrast, since the authors were all committed to unquestionable research practices, they uploaded a nice osf repository with a csv file listing all considered studies and the associated replication attempts.
The csv file has 167 papers published in psychology journals in 2008. Will ran a quick R script and found 35 papers reported t- or z- statistics, and of those, 6 reported scores greater than 5. I’m not sure what “rare” should be, but 6 out of 35 papers from 2008 would definitely fall in the “pretty common” range for me. However, as I’ve been trying to hammer on this blog, you can’t just look at summary statistics and conclude anything. What exactly are these 6 interventions?
I downloaded the csv file and dug up some of the papers. I only found 5 (not 6) that reported t>5. If you find the 6th one that I missed, let me know. The largest reported t-score was 14.47. Wow! But it was in a paper about macaques. This doesn’t count as human-facing. There were two fMRI papers with t-scores 5.39 and 5.64, but the sample sizes were so small that the p-values were actually larger than 0.0001. More importantly, these both came out before the famous dead salmon paper suggested that there was an awful lot of fishing for activation regions in fMRI studies. In any case, they’re excluded too.
The next largest score was 10.36. This was a t-statistic associated with a survey. The investigators gave 125 participants descriptions of various scenarios and asked them to score, on a scale from -4 to 4, whether it would be better to be pessimistic or optimistic when taking action in each scenario. The mean response was 1.1, and the standard deviation was 1.2.2 You can check for yourself that the t-statistic, equal to the mean divided by the standard deviation times the square root of the sample size minus one, is 10.36. But a survey is a measurement, not an intervention, so this one doesn’t count either.
The next one on the list was a t=10.18. I couldn’t find the t-score in the paper, but Will helpfully informed me that the F(1,d) statistic is the square of the t(d) statistic. The csv file briefly notes that the replicator decided to translate the F-statistic into a t-statistic. Indeed, the original paper reports an F-statistic of 103.7, and the square root of that is 10.18. OK, so what’s the intervention?
The investigators conducted a study to confirm an effect known in psychology since the 1960s: that words that are easily confused when listening are hard to recall in ordering experiments. They showed participants a sequence of six words, one at a time, then presented all six on the same screen and asked them to reconstruct the order in which they had been shown. They found that people were better at remembering the order of dissimilar words like break, sick, vote, greet, rat, fun than of similar words like vote, boat, goat, float, note, coat.
If you wanted to be a statistical pedant, you could quibble with the fact that there isn’t a clear estimand for a t-test in this experiment because of interference concerns. But I’ll let that slide. This was one of a long line of replications of this confusion effect, first observed by Reuben Conrad in the 1960s. Even back then, Conrad’s p-value was in less than one in a million billion (10-16). Tiny!
This probably should count as a 5-sigma intervention! It’s an interesting, robust observation, like optical illusions.
However, this finding wasn’t the main result of the 2008 paper. That study was about model building, not hypothesis testing. It demonstrated that a particular mathematical model could predict the magnitude of this well-known memory effect. The experiment with the large t-statistic was merely a means of collecting data for their model fitting. The authors are very clear about this.
I’m not sure why this study got thrown into the meta-analysis of the famous replication project. I’m also unsure why the Open Science Collaboration was unable to replicate this effect, which has been replicated repeatedly for sixty years. But I’ve given up asking that crew about their metascientific methods. For the purpose of this post, however, all that matters is that I have not yet found any 5-sigma interventions in their csv file.
Now let me be clear, I can create 5-sigma interventions in human activity by doing obvious things. I don’t think anyone is surprised that it’s hard to make memory pneumonics if all of the words sound like each other. Even less surprisingly, I could do an “experiment” where the outcome is reporting happiness on the Likert scale and the intervention was punching a person in the stomach. That would be 5 sigma for sure. I could also conduct an experiment in which I tortured participants to navigate bureaucratic nightmares in order to switch from a default, disadvantageous retirement savings plan to a higher-yield option. But unless you’re an economist, this is closer to stomach-punching than penicillin.
I don’t consider my weekend-warrior spreadsheet surfing dispositive, but I found nothing unaligned with my anarchobayesian priors. I maintain that five-sigma interventions are surprisingly common in medicine while exceptionally rare in all other human-facing sciences. Perhaps you have examples to refute this claim. Please yell at me in the comments. I’m all ears.
Yes, that is a paywalled link to a paper by a team that loves open science. Oh, the irony that comes with the prestige of Science The Magazine.
The authors don’t actually tell you the standard deviation. Instead, they write the inscrutable string of characters:
“Those asked to provide prescriptions recommended predictions that were optimistic (M = 1.12), t(124) = 10.36, p_rep > 0.99, d=0.93.”
In words, the mean survey response was 1.12. The t-statistic was 10.36. There were 125 participants. The probability this will replicate is 0.99. Cohen’s d was 0.93. I have no idea why they are reporting Cohen’s d. Cohen’s d needs two groups. There is no single sample version of Cohen’s d. But since Cohen’s d is a mean divided by the standard deviation, you can back out that this means that the standard deviation is 1.2, which is consistent with a t-statistic of 10.36.


So far, nothing about the Open Science movement has convinced me that Philip Mirowski was wrong when he wrote this:
“Almost everyone is enthusiastic that ‘open science’ is the wave of the future. Yet when one looks seriously at the flaws in modern science that the movement proposes to remedy, the prospect for improvement in at least four areas are unimpressive. This suggests that the agenda is effectively to re-engineer science along the lines of platform capitalism, under the misleading banner of opening up science to the masses.”
https://journals.sagepub.com/doi/10.1177/0306312718772086
"divided by the standard deviation times the square root of one minus the sample size"
I think that should be "the square root of the sample size minus one".