35 Comments

Ben:

1. Just for the benefit of the readers, the authors of the paper, in order, are Erik van Zwet, Andrew Gelman, Sander Greenland, Guido Imbens, Simon Schwab, and Steven Goodman. I like our paper; I just think all the authors deserve credit here.

2. Erik responded in comments regarding your claim that the paper is "unreproducible" and related data issues.

3. At the very beginning of the post, you write that we "aimed to show that all randomized experiments are false." We never said that or implied that or anything like that. Indeed, I have no idea what it even would even mean to say "all randomized experiments are false."

In summary, I get that you have strong feelings about reproducibility, based on a background that is different from ours. And I see that some of your commenters below appreciate your contrarian stance. That's all fine. The problem is that you seem to think you're arguing against us, but you're actually arguing against things we never did (perform an unreproducible analysis) or that we never said (something something about experiments being false).

It's kinda frustrating because I fear that people will read your post and not our paper (or this comment) and, as a result, come to the false impression that we did an irreproducible analysis, that "no one knows" what happened to 11,285 of the studies, and that we "aimed to show that all randomized experiments are false." To call of that a distortion of our paper would be an understatement; it's pretty much the opposite of what we do and say in our paper!

Expand full comment

Thank you for engaging Ben in such a fact-based and respectful way. I wish he had responded to your comment.

Expand full comment

The consensus is there is no concensus.

Expand full comment

Ben: Your criticism missed the mark by a mile.

> What happened to the remaining 11,285? No one knows.

Not true. Have a look at the online supplement of van Zwet, Schwab and Senn:

set.seed(123) # for reproducibility

d=read.csv("CochraneEffects.csv")

d=d %>% filter(RCT=="yes")

d=d[d$outcome.group=="efficacy" & d$outcome.nr==1 & abs(d$z)<20,]

d=group_by(d,study.id.sha1) %>% sample_n(size=1) # select single outcome per study

> Let me reiterate: this paper about reproducibility is itself unreproducible.

Not true. We have made the data that we used available, and you may check them against the publicly available Cochrane Database of Systematic Reviews (CDSR). All the papers in the series have an online supplement with code, so everything is fully reproducible.

> These are not the reported z-statistics, but rather the derived z-statistics by a method of other authors.

Not true. The data were compiled by Simon Schwab who is an author of both papers. For RCTs with a numerical outcome, we have the means, standard deviations and sample sizes of both groups. For RCTs with a binary outcome, we have the 2x2 table. From these, we computed the estimated effects (difference of means or log odds ratio) together with their standard errors in the usual manner. From these, we computed the z-values. There is really no basis for accusing us of trying to deceive.

> Why is it plausible that a z-score in a clinical trial is a sample from a mixture of Gaussians? It’s not. It’s ridiculous.

Just think of it as a flexible density estimator which has some convenient mathematical properties.

> no one has any idea what the content or value of all of this data is.

The CDSR is a very well known resource that is carefully curated. It's not perfect, but it's also not some random scrape.

Expand full comment

Aplogies - I misremembered. The NEJM Evidence paper is has an online supplement with the code, but the earlier paper with Schwab and Senn does not. However, I've just provided the code for the selction of those 23,551 trials.

Expand full comment
author

But this is my point, Erik. I know that many people are concerned with the replication crisis, but if we want to write papers critical of others replicability, then we need to hold ourselves to even higher standards of replicability. It's not great that these exclusion criteria were not in the original paper. And, moreover, *why* are you using these criteria? For instance, why exclude trials where the z-score is larger than 20?

Expand full comment

You're barking up the wrong tree, Ben. Our paper is not critical of other's replicability. Quite the opposite. We're simply pointing out that in a field where the signal-to-noise ratio tends to be low, people expect too much from p<0.05. It's not so unexpected when a study with p=0.01 does not "replicate" (get p<0.05 again.)

We're also not critical of the signal-to-noise ratio being low. In the paper with Schwab and Senn, we write: "The fact that achieved power is typically low does not imply that the usual sample size calculations aiming for 80% or 90% power are wrong. The goal of such calculations is to guarantee high power against a particular alternative that is considered to be of clinical interest. The fact that high power is often not achieved is merely an indication that treatments often do not provide the benefit that was hoped for."

The criteria for selecting the data are an atttempt to get the primary efficacy outcome, and to ensure that each trial occurs only once in our dataset. The selection for |z|<20 is because such large z-values are extremely unlikely for trials that aim to test if the effect is zero. It also wouldn't have made if big difference if we kept them in because our mixture model has one component with very large variance that can accomodate large z-values.

Expand full comment

Love this! You’re so contrarian that you criticise the people who criticise science! (And I think you’re right). You’re sort of a beacon of anti blind method application, so whenever that happens you flag it

Expand full comment
author

Hah, thank you. I am happy to embrace the label "Science Anarchist" that you called me on Twitter. :)

But when you look at the history of science, it's *never* been reproducible. It's always been a mess. And yet we learn from the mess...

Expand full comment

I’m forming my opinion on this atm - not sure atm exactly what to believe. I think there definitely a tendency from the metascience people to overregulate/ over standardise and make everything “too legible”. When great science (and great anything) happens at the edge of legibility and it’s sort of serendipitous.

Yet, I do think drowning in a sea of bad stuff is deleterious. You have tens of millions of dollars spent on research in Alzheimer’s that includes fake figures. You have people using fake social science to support some kooky policies and so on.

My solution to this atm is one nobody will like (make unis smaller)

Expand full comment
author

Those larger structural questions are good ones. It's possible that we actually make more progress with more people because we can try more stuff and hence increase the possibility of serendipity. But then everyone gets stressed out and sad, and it also becomes impossible to keep up.

Here's a case for more researchers: https://www.argmin.net/p/the-national-academy-of-spaghetti

Here's a case for less: https://www.argmin.net/p/too-much-information

I'm conflicted!

Expand full comment

You’re giving me fodder for thought. Might reply to you long form from my substack

Expand full comment
Jan 8·edited Jan 8Liked by Ben Recht

Ben is 100% correct in saying, essentially, that things go bad at the moment when methodology becomes ritualized. This is what got us into the whole p-value mess in the first place, and it's not really surprising that now the debunking has become a ritual as well.

Expand full comment

ok question for both of you (btw I am enjoying substack much more than twitter, so much less stupidity here - thanks ben for encouraging me to migrate and write long form).

so, I know that ritualised methodology is kinda dumb for towering intellects like us. BUT, do you think it might have a role? Like, sure it won't be everything and the greatest researchers won't simply guide themselves by p-values but it allows for a baseline level of "something" being right or semi right. An analogy here would be the role of hypocrisy in society. I used to underestimate it and ofc a hipocritically nice person is not a "very good" person but coming from a very chaotic and non hypocritical country (Romania), I think a level of hypocrisy allows for a sort of baseline level of goodness

Expand full comment
author

Two part answer:

1) I don't think RCTs have had that much impact on scientific discovery. BUT I think they play a critical role in our medical regulatory framework. And regulations necessarily require some form of ritual so everyone can agree upon the rules.

2) With regards to science, we create rituals in our labs and in our disciplines, sometimes explicitly but more often than not *implicitly*. This is unavoidable. Rituals themselves aren't bad, but we have to constantly questions whether they are necessary or holding us back. There has to be a give and take between "rules" and "play" in the game of science.

Expand full comment

Yeah! Exactly

Expand full comment

And the way you stop ritualisation is by making science smaller 😂

Expand full comment

"A small country has fewer people. Though there are machines that can work ten to a hundred times faster than man, they are not needed."

Expand full comment
Jan 9·edited Jan 9Liked by Ben Recht

> When you write scripts to digest tens of thousands of trial reports, and you don’t look at any of

> them, no one has any idea what the content or value of all of this data is [...]

> But you shouldn’t believe anything. Especially nothing that comes from an unreproducible process

> of web crawling. Let me reiterate: this paper about reproducibility is itself unreproducible.

Ben: These are some serious allegations you are making here publicly. I did the data collection very rigorously, and the Cochrane data from each systematic review is extremely well structured data in XML including meta data. I created a R package in 2021 that can import such data files. I just gave it a minor update and included more documentation. You can now download and collect the data yourself. https://github.com/schw4b/cochrane/.

Expand full comment
author

Thanks for the code update, Simon. My main issue is that papers that are about replication must hold themselves to the same standards as they want to set. In particular, your NEJM E paper was based on a model derived in your Stat in Med paper based on code from an online preprint. But it was never clear how to move between these:

1. the osf repo does not have instructions for how to make the csv file: https://osf.io/xjv9g/

2. the stat in med paper leaves out the exclusion rules that Erik notes here: https://open.substack.com/pub/argmin/p/is-the-reproducibility-crisis-reproducible?r=p7ed6&utm_campaign=comment-list-share-cta&utm_medium=web&comments=true&commentId=46899660

You could argue that these are minor issues and can be fixed by private correspondence, but it's 2024 and that shouldn't be necessary. It especially should not be necessary for a paper that is questioning the replicability of all medical trials. My concern is that no one can achieve the Platonic ideal of reproducible science that metascientific critiques argue for.

Expand full comment
Jan 8Liked by Ben Recht

> Wait, hold on. These z-scores aren’t derived from the reported p-values in RCTs? They are p-values computed by other metascientists for another project. That’s weird! In the NEJM Evidence article Van Zwet et al. claim “We have collected the absolute z statistics of the primary efficacy outcome for all of these RCTs.” But this is misleading at best. These are not the reported z-statistics, but rather the derived z-statistics by a method of other authors. We’re looking at something different than what the authors have led us to believe. In that case, we shouldn’t believe any of the conclusions in the NEJM Evidence paper

Technically, I agree with you here. This type of evidence is far from ideal. It is also extremely rare for influential statisticians to publish in NEJM, Nature, Science, Nature Medicine, etc.. In order to be a methodological critic in journals like NEJM, you absolutely need empirical data. From their perspective, any empirical data (even flawed) is better than a good theoretical argument. Chances are, if you randomly select papers from amazing applied statisticians (like the authors on this list) from clinical or science papers, it won't be their best work. This type of empiricist bias is one of them.

Such piecemeal approach to analyzing data is so common across much of pre-clinical biomedicine. It makes the statistician in me cringe. There is no guarantee sufficient statistics have been preserved for the final inferential question across a mish-mash of preprocessing done by different people. Practically though, it is impossible to get anything done without working with the data that exists. Almost nobody will let you re-analyze clinical datasets or give you access to individual participant level data. Certainly not on this scale. I don't blame them for making the best of datasets that already exist.

Ideal metascience work where you can interrogate clinical results all the way from individual participant level data would be an amazing godsend. In practice, you still need approval from every original PI to get access to it. E.g see https://vivli.org/resources/requestdata/. Nearly impossible task.

Expand full comment
author

I don't disagree with anything you write here. But I have a couple of rejoinders:

1) I agree that most papers by good authors aren't great, but the lead author and Andrew were both all over social media with this particular paper, and I saw it raise a lot of confusion among clinicians.

2) I worry about doing science to appeal to journal standards. Journals are indeed stodgy and obnoxious with their unearned righteousness. Who are these people to decide what is good or bad or valid? It's ridiculous. That said, the solution is burning down these journals, not tying our hands to appease them.

Expand full comment

It is hard to build a career in academia with strategy 2). I've refused to do the kind of work deemed acceptable but such journals if it doesn't meet my standards but it just means they publish work from others that is worse. There is something to be said for incremental changes. If I could do it over, I would say better to publish early and iterate work continually.

Expand full comment
Jan 8Liked by Ben Recht

"I don’t want to go into the nitty-gritty of what a p-value is here because it’s annoying and distracting." ~ this is naive, but whenever smart people say this, I never really get what's so complicated. do you have other writing on the nitty gritty of p-values, for those who welcome the distraction?

Expand full comment
author

Fair. On the one hand you're right, it's not complicated: the p-value is the probability that the observed data (or something more extreme) would be observed assuming a null hypothesis. And yet, hashing out what we should believe based on these values seems to only lead to confusion, pain, and suffering...

Expand full comment
Jan 8Liked by Ben Recht

Some related questions:

How do you handle in a principled manner the statistics from the failed trials? From undersized trials done to investigate if there is a large effect or even signs of an effect? Drug development is meta-Bayesian and anything that's not a registrational trial is likely being conducted to inform the overall program, not just hit a p-value target.

All tie to your main point here...

Expand full comment
author

Let me respond to you questions obliquely with an unpopular answer: Randomized trials should be used as part of regulatory frameworks not as part of scientific discovery. RCTs were popularized because the FDA and NIH wanted better evidence that treatments were safe and effective. AB tests are most effective as means of preventing bad code from getting pushed into production. RCTs create a bar (albeit, a low one) for drugs/software to clear.

On the other hand, there is scant evidence that RCTs are particularly useful for "science" per se.

Expand full comment
Jan 8Liked by Ben Recht

Come to think of it, you're going to have later stage trials in there where there's an effect size that is nonzero but the trial will appear underpowered because it is designed to find only effects above a certain magnitude for commercial reasons. Often times you'll see drugs where they're looking for an effect large enough to beat competition or other paths of treatment, not just something that's very much non-zero. Those two, would mess with your metascience statistics.

Expand full comment
Jan 8Liked by Ben Recht

Agreed, which is why Ph1/2 trials are very often not RCTs (or at least not fully R or C). But they'll still appear in those databases....

Expand full comment

You have your causal ordering wrong. RCTs became popular because they are the most effective means for making causal inferences when studying a certain type of scientific problem (roughly, "what happens when we do X?"). It is this reason groups like the FDA and NIH require them. Their effectiveness caused both their rise in popularity and their regulatory requirement. Of course, I'm sure them being required increased their use, but that's not why they came into vogue.

In the social and biomedical sciences, RCTs are and will remain one of our best tools for making causal inferences. They have contributed to and advanced science greatly within the domain where they are effective.

They are not a panacea. Problems with scaling from RCTs to other contexts are the rule, and not the exception. They don't let you study phenomena that are not subject to intervention, for practical or ethical reasons. They also come with their own uncertainty.

Nevertheless, it's unfathomable to me to actually believe they aren't "useful for 'science," unless you define science extremely narrowly.

Expand full comment
author

Can you point me to the history that shows that RCTs became popular because of causal inference? I'd be interested in understanding the evidence better as your assertion goes against my study of the history.

Expand full comment

The wikipedia page gives a fair enough rundown in the history section: https://en.wikipedia.org/wiki/Randomized_controlled_trial

To be clear, I don't mean the field of statistical work we now call "causal inference" popularized RCTs. But rather, the ability to exploit randomness to provide confidence in causal explanations -- a practice older than the phrase RCT or its formalization -- is why we use them. They are scientifically effective in contexts where people want to know with confidence what will happen when an intervention is done. Naturally, they then became popular in the biomedical sciences, agriculture, and program evaluation.

Clinical researchers don't just do RCTs because their funding bodies require them. They do them because they are a powerful way to generate knowledge.

Expand full comment
author

I'm sorry, as Neri Oxman and Bill Ackman have taught us all, the wikipedia doesn't count, even if you cite it. And the wiki summary of the history of the RCT is beyond cursory. It mentions streptomycin and then jumps to the ISIS trial. A lot happened between those two events!

It's neither responsible or accurate to ignore the roles of the FDA, the NIH, and the NCI in the institutionalization of the RCT in clinical practice.

Expand full comment
Jan 12·edited Jan 12

I have no doubt that nor would I at all argue against the claim that numerous professional bodies have encouraged more RCT use. My doctoral training was funded by a program specifically put in place to institutionalize RCT use in my (then) subfield.

My point is not to argue the finer points of history but to point out that RCTs widespread use—and their institutionalization—is because they’re effective scientific knowledge generators. Do you contend they were institutionalized for other reasons?

Expand full comment

When the author doesn't respond to the people he's casting allegations against.

Expand full comment