Replication Versus Reproduction

Meehl's Philosophical Psychology, Lecture 8, part 2.

Jun 24, 2024

This post digs into Lecture 8 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.

Leif Weatherby reminded me that there is a distinction between Reproduction and Replication that’s worth repeating in the context of Meehl’s lectures. I realize now that I conflated these in my last post, and that’s a grave error. Replicability and reproducibility are fundamentally different demands on experimentation and are important for fundamentally different reasons. I’m going to take this post to delineate the meaning of two words with the same first three letters.

Let me yield to leading metascience experts for the definitions of two rep-concepts. Nosek et al.’s Annual Reviews article “Replicability, Robustness, and Reproducibility in Psychological Science” provides solid definitions. Reproducibility means that “if someone applies the same analysis to the same data, the same result should occur.” Replication means that if someone does “the same study again,” “the same outcome recurs.”

By these definitions, reproducibility is such a small ask! If someone releases a repository of code and data, all of the analyses in the main paper are easily reproduced. I’ve become a major stickler for reproducibility. There are no excuses for scientific results to be unreproducible. I’ve seen instances where people release the plotting command to generate plots from (x,y) pairs but never tell you how they computed the xs and ys. That is simply not enough. There should be a complete, well-commented pipeline that takes the rawest possible data and produces a rendering of the paper. Every scientist produces their papers with computers: after data collection, all of the analysis, visualization, and writing is computerized. Hence, it is by nature completely reproducible. Reproduction just demands scientists log the steps they take from data to journal submission.

Replication, on the other hand, is far harder to pin down. Let me give the full quote of Nosek et al.’s definition,

“Replication seems straightforward—do the same study again and see if the same outcome recurs—but it is not easy to determine what counts as the same study or same outcome.”

In their definition of replication, they tell you that it’s hard to define. I love it. The very next sentence states, “There is no such thing as an exact replication.” And here lies the rub. I can tell you what an exact reproduction is. It’s running the same code on the same input twice. Demonstrating reproduction just requires sharing data and code. But even metascientific experts can’t tell you what exact replication is. You can’t repeat the same experiment with the same researchers with the same experimental subjects at the same chronological time. Something has to change.

And here lies the importance of the theory that Meehl hammers over and over again in his lectures. Jump back to when we were discussing derivation chains. We had a theory, auxiliaries, instruments, ceteris paribus assertions, and experimental conditions. The last three are always different in two different experiments. The question is a matter of how different they can be so that the predictions come out the “same.”

If a theory predicts experimental uncertainty, then we don’t even expect the “same” outcome even if we had a time machine that allowed us to repeat the same experiment. If my nomological derivation chain makes a probabilistic prediction, even in the fake reality where I can frequentist-style repeat my experiment ad infinitim, I’ll see different outcomes in different realizations. If you have a shaky ceteris paribus clause or very complicated experimental conditions, you add even more uncertainty to ever seeing the same result. We can only vaguely justify repeatability. While reproductions can and should be exact, replications cannot be.

The distinction between reproduction and replication is so key and important that we should be pedantic sticklers for the jargon. I don’t want to call anyone out, but do a search for “reproducibility crisis,” and you’ll find a lot of people talking about replication. Reporters and many scientists use the words interchangeably. They are not interchangeable. Replication and reproduction are asking for such incomparable things that we need two terms.

There is no “crisis” in reproducibility. We have a reproduction problem because people don’t always produce the most usable pipelines (I got in a bunch of trouble pointing this out in January). But a reproduction problem is so easy to fix. We just have to hold ourselves to a higher standard when communicating data. We could solve the “reproducibility problem” by the end of business today.

The replication crisis is a whole other matter. In my predictably contrarian opinion, the replication crisis is overblown. Yes, studies in many disciplines fail to replicate. But studies have always been hard to replicate. Some of the most important insights follow when thinking about why a replication failed. A failure to replicate can be more information than replication itself. Failed replication leads to fights about theoretical derivation chains and refinement or augmentations of theory. Lakatosian Warfare progresses because there are no perfect replications. It’s through the contradictions and conflicts that research programs move forward. Failure to replicate is core to scientific advancement.

Don’t get me wrong, there’s a problem if a field spends decades producing results that are fragile to the slightest change in experimental conditions. In such fields, I worry that contemporary metascience spends too much time arguing that better methods will fix the problems. It’s pretty clear from the outside that if a field spins itself in circles failing to replicate experiments, “science” won’t solve this community’s problems. Call me crazy, but some parts of the world can’t be mathematicized or sciencified. I’ll expand on this argument in a future post.

Bringing this back to the Lectures, Meehl makes similar claims. Replication for Meehl is a means to deal with the variability of experimental conditions and the ceteris paribus clause. If a theory yields good predictions for a variety of experimental conditions and under loose ceteris paribus restrictions, then it’s a Salmonian “Damned Strange Coincidence.” These theories get corroborated.

By contrast, reproduction is a means to better understand scientific data. If we’re only given p-values and lumped statistics, how can we answer skeptical questions about the data analysis? Reproduction also makes it possible for other scientists to detect analysis bugs. As I’ve hammered, contemporary scientific derivation chains are deeply dependent on software validity. Making it possible for others to find coding bugs is thus critical and straightforward.

In his lecture, Meehl explicitly argues for replication but only implicitly argues for reproduction. To Meehl’s credit, it was much harder to produce a git repository in 1989, as git wouldn’t be invented until 2005. Enforcing reproducibility was somewhat challenging in 1989. It is trivial now. Failures in reproduction should today be inexcusable. If journals and conferences required registration of quality code and data, reproduction failures should never happen. The main findings of the paper should follow from a processing chain from the data to the plots and text. This should be clear and trivial to mechanically reproduce by peer reviewers. Reproducibility is such a minor request, and fields only harm themselves by resisting it.

Chris

I wouldn't say that reproducibility is as trivial as you say. The whole nix ecosystem was created because Eelco Dolstra's thesis[1] showed that even to reproduce software reliably, which is to say nothing of the data, requires cryptographic naming conventions and a functionally pure (not side-effecting, compositional) build system, which is extremely hard to do. I've talked to some highly accomplished people who say that the idea of nix is beautiful and pure, but not workable in practice. As far as I understand it, it's the reason why Docker is the preferred model over nix.

For a more timely and relevant example, there's huggingface's model reproduction pipeline - they just keep a frozen python script that builds and trains the model. Bugs are not allowed to be fixed, and it relies on everything that was available at time t0 always being available under the same name, which is not always so. No matter how hard you try, someone will fix some problem, and replace the old broken thing with some new thing of the same name. That's often a good idea, but it flies in the face of any claim that reproducibility is trivial.

And then there's data... In theory, data is just as hard or just as easy to replicate as software - they're both just digital artifacts. But. You'll never have to worry about HIPAA or PII or deanonymization problems with pure software, but those can be problems with data.

That said, I think the talk of reproducibility being trivial is a distraction from the more interesting point of this article, which really only gets one paragraph - that failure to replicate is interesting in itself. I was hoping for a whole article on that topic, which I think you would have had a lot more interesting thoughts on, that I wanted to hear!

[1] https://www.semanticscholar.org/paper/The-purely-functional-software-deployment-model-Dolstra/7c9d53d567c4db2034d8019ff11e0eb623fe2142

See also "Problems With Existing Solutions" at

https://jonathanlorimer.dev/posts/nix-thesis.html

Expand full comment

2 replies by Ben Recht and others

rvenkat

The link to Nosek et al is broken. This one (https://www.annualreviews.org/content/journals/10.1146/annurev-psych-020821-114157) takes us to the correct page.

1 reply by Ben Recht

4 more comments...

arg min

Discussion about this post