Is the reproducibility crisis reproducible?
Rekindling the war on p-values in 2024.
Right before I left for the winter break, Andrew Gelman and his collaborators lit up social media with a paper claiming that most clinical trials were underpowered. In a paper in NEJM Evidence, they set their sights on the summary statistic everyone loves to hate, the p-value. Gelman referred to this paper as “Bayesians moving from defense to offense.” The coauthors included econ Nobel winner Guido Imbens and prolific epidemiologist Sander Greenland. The paper contained claims like
“The probability of a replication study yielding P<0.05 in the same direction is small, even in the stratum from 0.005 to 0.001. Thus, a replication with P>0.05 does not imply that the original finding was a fluke at least not in the context of historical clinical trials just as P<0.05 in the original study does not imply that the initial conclusion was correct, especially when P is near 0.05.”
Certainly, proponents of the replication crisis love to say things like this. We all love to terrify our students with claims that science is wrong and unreplicable because our colleagues are not rigorous enough. But does this new paper provide good evidence of such stern warnings?
Let’s dig in! What did they do exactly? The paper irked me from the get-go by triggering one of my biggest methodological pet peeves. They aimed to show that all randomized experiments are false by running a deeply confounded observational study. Boo! But the details turned out to be worse than that.
Let’s start at the highest level of how the authors try to make their case. I need one technical bit of annoying statistics to be precise in my description: the paper drew conclusions about a model of the z-score of outcomes in clinical trials. What is a z-score? It is a quantity that is morally equivalent to a p-value. I don’t want to go into the nitty-gritty of what a p-value is here because it’s annoying and distracting. The following will suffice. Every outcome in a randomized trial gets labeled by a p-value. The smaller the p-value, the more we’re convinced that the measured difference between the groups is due to the treatment and not the random assignment of people into two groups. Instead of a p-value, you could report the z-score. There’s a formula that lets you go from one to the other. The larger the z-score, the smaller the p-value, and vice versa. This paper worked with z-scores rather than p-values because, for technical reasons, z-scores are easier to work with mathematically. The authors aimed to show that their model of z-scores of clinical trials displayed some worrying properties that would suggest a replication crisis.
In this paper, the proposed method was:
Collect a huge data set of the z-scores of randomized controlled trials from the Cochrane database
Build a statistical model of these z-scores
From this statistical model, draw a bunch of samples to simulate expected rates of reproducibility or misestimation of primary effect sizes.
Everything here rests on step 1. Is their data set of z-scores legitimately representative of all clinical trials in medicine? It turned out that this NEJM Evidence paper didn’t actually do step 1 at all. In fact, this paper started at step 3! They took a model directly from a 2021 Statistics in Medicine paper by van Zwet, Schwab, and Senn. This earlier model fit a mixture of four zero-mean Gaussians to the symmetrized set of z-scores from some trials they had scraped. Why is it plausible that a z-score in a clinical trial is a sample from a mixture of Gaussians? It’s not. It’s ridiculous. But I’m not even going to dwell on the model because I still want to focus on where the data came from.
van Zwet, Schwab, and Senn’s 2021 paper did not have code, but it linked to their data set. They claimed they fit their model to 23,551 randomized trials in the data set. I downloaded the data and counted 34,836 unique trials. What happened to the remaining 11,285? No one knows. I decided to dig deeper to see if I could find out.
van Zwet, Schwab, and Senn assert that Schwab, Kreiliger, and Held initially collected and processed the data in 2019. I looked at this 2019 paper and found a shocking passage:
“Comparing reported effect sizes from primary studies across medical specialties required harmonization. We recalculated all effect sizes for the primary studies using escalc() from the R package metafor . For continuous outcomes the SMD (Hedges’ g ), for dichotomous outcomes the odds ratios was used. Odds ratios were transformed into Hedges’ g which were then transformed into Pearson’s r. Statistical significance was calculated with a Wald’s test which was performed on the original effect measure as reported in the CDSR (e.g., risk ratio, hazard ratio, e tc.); for mean differences we applied a two-sample Student’s t-test.”
Wait, hold on. These z-scores aren’t derived from the reported p-values in RCTs? They are p-values computed by other metascientists for another project. That’s weird! In the NEJM Evidence article Van Zwet et al. claim “We have collected the absolute z statistics of the primary efficacy outcome for all of these RCTs.” But this is misleading at best. These are not the reported z-statistics, but rather the derived z-statistics by a method of other authors. We’re looking at something different than what the authors have led us to believe. In that case, we shouldn’t believe any of the conclusions in the NEJM Evidence paper.
This is why you shouldn’t form “priors” based on “big data scraping.” When you write scripts to digest tens of thousands of trial reports, and you don’t look at any of them, no one has any idea what the content or value of all of this data is. Someone just shows you a univariate histogram and wants you to believe that it’s true. But you shouldn’t believe anything. Especially nothing that comes from an unreproducible process of web crawling. Let me reiterate: this paper about reproducibility is itself unreproducible.
Erik van Zwet asserts “I really think it’s kind of irresponsible now not to use the information from all those thousands of medical trials that came before. Is that very radical?” But how we use the information from past trials matters. Scaping all trials with an R-script and blindly stitching together a table of p-values without ever looking at any of the trial reports is radical. This is an absurd path to knowledge building. Am I surprised that the authors of the NEJM Evidence article concluded that they can explain the reproduction crisis? No. As is the case with all observational studies, the authors found exactly what they were looking for. But are the conclusions supported by the “data?” No. The evidence is weak.
I respect what the authors are trying to do, and I’m a fan of a lot of their work! I write this blog just as a warning though: it’s really hard to play by the rules of rigor that the metascience community thinks will fix science. And I actually think that the proposed fixes of more statistics and methodological handcuffs won’t make a dent in the problem.
Instead, the fix is to stop this sort of big data mining and get back to basics. People who are experts at reading clinical trials know that the p-value is the tiniest piece of the puzzle. In fact, the p-value in a clinical trial is never needed. Clinical trials use the most primitive tests to compare outcomes, and p-values are just one way of summarizing the three numbers associated with an outcome. Any outcome has a number of bad events in the treatment arm, a number of bad events in the control arm, and a number of patients in the study. From these three numbers, you can compute a p-value. But the p-value here just serves as a sanity check to confirm the measured effect of the intervention is large enough. After that, you have to look at the trial implementation for other biases and potential alternative explanations for the differences between groups. A lot goes into building this intuition for reading trial reports, and it takes time to learn what to look for. But if you want to fix experiments in clinical medicine, this is where you start. A strict Bayesian statistical model of the p-value won’t be part of your process at all.