1. Just for the benefit of the readers, the authors of the paper, in order, are Erik van Zwet, Andrew Gelman, Sander Greenland, Guido Imbens, Simon Schwab, and Steven Goodman. I like our paper; I just think all the authors deserve credit here.

2. Erik responded in comments regarding your claim that the paper is "unreproducible" and related data issues.

3. At the very beginning of the post, you write that we "aimed to show that all randomized experiments are false." We never said that or implied that or anything like that. Indeed, I have no idea what it even would even mean to say "all randomized experiments are false."

In summary, I get that you have strong feelings about reproducibility, based on a background that is different from ours. And I see that some of your commenters below appreciate your contrarian stance. That's all fine. The problem is that you seem to think you're arguing against us, but you're actually arguing against things we never did (perform an unreproducible analysis) or that we never said (something something about experiments being false).

It's kinda frustrating because I fear that people will read your post and not our paper (or this comment) and, as a result, come to the false impression that we did an irreproducible analysis, that "no one knows" what happened to 11,285 of the studies, and that we "aimed to show that all randomized experiments are false." To call of that a distortion of our paper would be an understatement; it's pretty much the opposite of what we do and say in our paper!

Expand full comment

Love this! You’re so contrarian that you criticise the people who criticise science! (And I think you’re right). You’re sort of a beacon of anti blind method application, so whenever that happens you flag it

Expand full comment

Ben: Your criticism missed the mark by a mile.

> What happened to the remaining 11,285? No one knows.

Not true. Have a look at the online supplement of van Zwet, Schwab and Senn:

set.seed(123) # for reproducibility


d=d %>% filter(RCT=="yes")

d=d[d$outcome.group=="efficacy" & d$outcome.nr==1 & abs(d$z)<20,]

d=group_by(d,study.id.sha1) %>% sample_n(size=1) # select single outcome per study

> Let me reiterate: this paper about reproducibility is itself unreproducible.

Not true. We have made the data that we used available, and you may check them against the publicly available Cochrane Database of Systematic Reviews (CDSR). All the papers in the series have an online supplement with code, so everything is fully reproducible.

> These are not the reported z-statistics, but rather the derived z-statistics by a method of other authors.

Not true. The data were compiled by Simon Schwab who is an author of both papers. For RCTs with a numerical outcome, we have the means, standard deviations and sample sizes of both groups. For RCTs with a binary outcome, we have the 2x2 table. From these, we computed the estimated effects (difference of means or log odds ratio) together with their standard errors in the usual manner. From these, we computed the z-values. There is really no basis for accusing us of trying to deceive.

> Why is it plausible that a z-score in a clinical trial is a sample from a mixture of Gaussians? It’s not. It’s ridiculous.

Just think of it as a flexible density estimator which has some convenient mathematical properties.

> no one has any idea what the content or value of all of this data is.

The CDSR is a very well known resource that is carefully curated. It's not perfect, but it's also not some random scrape.

Expand full comment
Jan 9·edited Jan 9Liked by Ben Recht

> When you write scripts to digest tens of thousands of trial reports, and you don’t look at any of

> them, no one has any idea what the content or value of all of this data is [...]

> But you shouldn’t believe anything. Especially nothing that comes from an unreproducible process

> of web crawling. Let me reiterate: this paper about reproducibility is itself unreproducible.

Ben: These are some serious allegations you are making here publicly. I did the data collection very rigorously, and the Cochrane data from each systematic review is extremely well structured data in XML including meta data. I created a R package in 2021 that can import such data files. I just gave it a minor update and included more documentation. You can now download and collect the data yourself. https://github.com/schw4b/cochrane/.

Expand full comment
Jan 8Liked by Ben Recht

> Wait, hold on. These z-scores aren’t derived from the reported p-values in RCTs? They are p-values computed by other metascientists for another project. That’s weird! In the NEJM Evidence article Van Zwet et al. claim “We have collected the absolute z statistics of the primary efficacy outcome for all of these RCTs.” But this is misleading at best. These are not the reported z-statistics, but rather the derived z-statistics by a method of other authors. We’re looking at something different than what the authors have led us to believe. In that case, we shouldn’t believe any of the conclusions in the NEJM Evidence paper

Technically, I agree with you here. This type of evidence is far from ideal. It is also extremely rare for influential statisticians to publish in NEJM, Nature, Science, Nature Medicine, etc.. In order to be a methodological critic in journals like NEJM, you absolutely need empirical data. From their perspective, any empirical data (even flawed) is better than a good theoretical argument. Chances are, if you randomly select papers from amazing applied statisticians (like the authors on this list) from clinical or science papers, it won't be their best work. This type of empiricist bias is one of them.

Such piecemeal approach to analyzing data is so common across much of pre-clinical biomedicine. It makes the statistician in me cringe. There is no guarantee sufficient statistics have been preserved for the final inferential question across a mish-mash of preprocessing done by different people. Practically though, it is impossible to get anything done without working with the data that exists. Almost nobody will let you re-analyze clinical datasets or give you access to individual participant level data. Certainly not on this scale. I don't blame them for making the best of datasets that already exist.

Ideal metascience work where you can interrogate clinical results all the way from individual participant level data would be an amazing godsend. In practice, you still need approval from every original PI to get access to it. E.g see https://vivli.org/resources/requestdata/. Nearly impossible task.

Expand full comment
Jan 8Liked by Ben Recht

"I don’t want to go into the nitty-gritty of what a p-value is here because it’s annoying and distracting." ~ this is naive, but whenever smart people say this, I never really get what's so complicated. do you have other writing on the nitty gritty of p-values, for those who welcome the distraction?

Expand full comment
Jan 8Liked by Ben Recht

Some related questions:

How do you handle in a principled manner the statistics from the failed trials? From undersized trials done to investigate if there is a large effect or even signs of an effect? Drug development is meta-Bayesian and anything that's not a registrational trial is likely being conducted to inform the overall program, not just hit a p-value target.

All tie to your main point here...

Expand full comment

When the author doesn't respond to the people he's casting allegations against.

Expand full comment