The replication crisis in psychology is over. Or at least that’s what I read in this news article in Science yesterday. Paul Bogdan, a neuroscientist at Duke, scanned 240,000 papers in psychology and found that weak p-values had decreased from 32% to 26% in papers published between 2004 and 2024. Let’s break out the champagne, everyone. I don’t want to dump on Paul, who’s caught up in a viral whirlwind here. I’m just surprised by the fervor of the science reform community’s excitement that made this paper go viral in the first place.
Joe Bak Coleman has been on a tear about the overreaction on Bluesky. I recommend his timeline for exasperated puzzlement about what on earth the results of this paper mean. But it should be obvious that the problem in psychology was not that 6% of the papers had p-values in a bad range. Shouldn’t we expect 0% of the results to be weak if we were all good scientists? We’re going to settle for 26%?
Peter Ellis wrote a great blog post about what “weak p-values” could even mean for those of us who like to get into the statistical mud. He points out, correctly, that modeling the proper distribution of p-values requires massive, unjustifiable, unverifiable assumptions. He shows that under standard models where effects, interventions, and outcomes are chosen at random in trials (random trials with controls, if you will), the distribution of p-values is extremely sensitive to how you model the population of potential experiments.
Studying the replication crisis with more statistical modeling will never bring us any resolution.1 To do studies of what we should “expect” to see, one has to assume some probabilistic model for generating science. These models are woefully inadequate for capturing anything about the social dynamics of how science works. They necessarily ignore the fact that people writing scientific papers know they have to run a null hypothesis significance test to get their paper published. This is Goodhart’s Law screaming at you loudly: p-values are a regulatory mechanism, not a measurement device.
It doesn’t usually get much clearer than this. We only talk about p-values at all because they are a bar that must be cleared to get some study approved, be it a drug trial, an A/B tested piece of code, or a scientific paper. Studying the distribution of p-values around 0.05 is studying Milton Friedman’s thermostat.
When you blog for long enough, you find yourself saying the same thing more than once. I can see why Matt Levine is so successful. He says the same five things every week, but the people doing money stuff keep doing the same outrageous things, so it never gets old. In the spirit of everything being securities fraud, Bogdan’s paper and its reaction are hitting multiple buttons at once for me: Observational studies are never-dispositive stories about correlations, significance testing is a regulatory device, p-values are correlation coefficients in a trench coat.
I’ve said that last one before a dozen times on here, but I think I need to keep saying it.
In 99.999% of the cases where the term is used, p-value is a goofy way of summarizing the correlation between an intervention and its outcome. Here’s how you do it. First, compute the Pearson correlation coefficient between the treatment and outcome. This is a number between -1 and 1. Multiply this number by the square root of the sample size. If this scaled correlation coefficient, call it the score, has magnitude larger than 2, we say the p-value is less than 0.05. If the score is larger than 2.6, we say the p-value is less than 0.01. If the score is larger than 3, we say the p-value is less than 0.0025. And if the score is larger than 4.4, we say the p-value is less than 0.00001.
I could go on. You compute the scaled correlation coefficient score:
score = correlation_coefficient(treatment, outcome)*np.sqrt(sample_size)
And look this score up in a table to find your p-value.
I’m belaboring this because it’s so silly to collapse all of the information of your experiment into this one number and assume it tells you anything. It’s also silly because blindly scaling the coefficient by the sample size means you can lower your p-values just by moving away from in-person experiments to online Mechanical Turk experiments.
It’s also obviously silly to set this score as a target for all experiments. The treatment-outcome correlation can mean completely different things in different contexts. Correlations need stories! Trying to pretend you can do science without the stories isn’t going to advance anything.
Because we can tell ourselves all sorts of stories about what small and large effects might look like here. How correlations should scale depends on the problem you are studying. If you are conducting a vaccine study, you expect almost no one to actually contract the disease. Therefore, the correlation between the treatment and outcome will be nearly zero, as the outcome “not sick” is almost always positive.
For example, of the 43 thousand participants in the Pfizer COVID vaccine trial, only ten developed “severe COVID.” Nine participants were in the control group, and one was in the treatment group. That seems to be a big effect! The correlation between the treatment and severe-COVID outcome was 0.01. This value is small because the severe COVID condition is exceptionally rare. Scaled by the root sample size gives a score of 2.5. Looking it up in a table, the p-value is 0.012.
Now suppose you are studying some small effect in an A/B test. You think that some new widget will increase everyone’s time on a website by 30 seconds a week. The correlation between your widget and time on site is close to zero because your defined outcome is inherently noisy, and your population of website addicts is very heterogeneous. But you’ll get your code shipped if the correlation between time on the website and your widget is large enough. So you run the test on a million people and cross your fingers.
I mean, I get it. We need some rules for pushing code and shipping drugs. We can move p-value thresholds around, and people will adjust their practices to meet these thresholds. But thinking that you can study the statistics of these scores and infer something about the validity of knowledge is a fool’s errand. You can make a strong case that drug discovery needs formal rules to promote patient safety. But logic devolves into absurdity pretty quickly if you try to make the same case for scientific expertise.
Last time I said this, Andrew Gelman issued a fatwa against me.
I’m here for your same five points week in and week out. Pump it straight into my eyes!
P-values as bureaucratic rule / regulatory device is a statement worthy of elevation to revelation.
Wasn't p-value hacking a known component of this whole dust up? Soooo.....
I feel like we're in need of a suitably grim and sly 'iron law' as a larger intellectual framework for these kind of discussions- maybe 'doing sophisticated statistics is only necessary to find small effects, and small effects have a habit of not existing'.