10 Comments
User's avatar
Liam Baldwin's avatar

> “A ‘valid’ preregistration plan necessitates knowing the outcome of all aspects of an experiment before conducting it. Preregistration makes it impossible to adapt to the actuality of experimental conditions.”

This is definitely a large drawback of preregistration, but isn’t it just a constraint introduced by frequentist assumptions (of which I recognize you’re generally skeptical)? It would be nice to learn-as-you-go from data, and adapt your methods appropriately, but this does render hypothesis tests useless/biased. Given that this seems to be what researchers are currently doing, a mechanism that weakly-enforces concordance with a priori design/hypothesizing seems reasonable.

How else should we get around this?

Expand full comment
Ben Recht's avatar

Sure, I have written about hypothesis tests before, and think they are useless and biased for inference, but a powerful tool for rulemaking and regulation:

https://arxiv.org/abs/2501.03457

With regards to papers, the NHST are just there as part of the rule book to get published. I think preregistration is a poor rule: it enforces excessive rigidity, requires a sort of mystical foresight about how experiments will play out, and since no one wants to run an experiment that doesn't pass the test, it creates perverse incentives in design.

I personally think that the other open science innovations in terms of software and code sharing are more valuable, as communities can then create a deeper inferential digestion of their work.

Expand full comment
Alex Holcombe's avatar

I see a lot of posts like yours written as if science reformers are trying to mandate preregistration, but as we science reformers try to emphasize, "a preregistration is a plan, not a prison!" The motivation for this slogan is, as you wrote in the post, preregistrations often cannot anticipate where a study will take one. Thus the idea is not to prevent exploration and pivoting based on what one finds along the way, but rather to conduct transparent reporting so that the potential that analyses contingent on seeing the data that were done (which can elevate the rate of false discoveries) can be evaluated. Some additional nuance: for medical clinical trials, preregistration has been mandated by all the leading journals since 2005, and this predates the "science reform" movement.

Expand full comment
Ben Recht's avatar

This comment deserves a longer response from me, and I'll add it to my todo list. But let me give it an attempt as bullet points:

I think metascientists really need to grapple with the fact that science and drug trials are different things and chase different ends.

I'm fine with the vague notion of teaching preregistration as a way to solidify one's thinking before blind experimentation. The problem arises when you move from rule of thumb to strong recommendation.

If you are going to introduce a checklist, even one that’s not mandatory but only strongly recommended by every major jounral, you need to make sure we’re not wasting people’s time or creating new bad practices. If your checklist prevents all deaths from infection in a hospital, that’s a strong case. We have nothing remotely close to this for preregistration. Instead, we get a lot of misplaced aspirational language and a lot of unjustified faith in frequentist inference.

Having been in the middle of the data-use/adaptivity/overfitting crisis in machine learning, let me warn that a lot of statistical facts are just dogma. Looking at data twice is a lot less harmful than the frequentists might have you believe.

Expand full comment
Alex Holcombe's avatar

About your last point, your writings on that are why I subscribed to the blog, so thanks, and it would be great to see you apply the lessons from the overfitting crisis to whether we should not worry about p-hacking as some of us do. About "metascientists really need to grapple with the fact that science and drug trials are different things and chase different ends", indeed I was drawing a contrast between science reform and drug trials (sorry if that wasn't clear). It's required for such RCTs but not anywhere else to my knowledge, and I don't think that anywhere related to the science reform movement has mandated it, appropriately in my view. As I mentioned, I see a lot of posts on social media railing against mandates but unfortunately the rants stay at a general level without citing specific examples. This makes me think people hold straw-man views of each other, whereas if they were discussing a specific proposal for a checklist/requirement for a specific area, they might agree, so I want to see both sides engage with very specific scenarios.

Expand full comment
Kevin Munger's avatar

Finally got a chance to read this one -- I love the framing. I'd only add the critique that the term validity is /binary/, and that this means that internal and external (temporal) validity are somewhat different things. From me and Drew's paper:

Translating this intuition to statistical practice within social science, we might say that an epistemic

community that puts unbiasedness over precision will get neither. Assumptions are unavoidable. Proceeding from this premise, we aim to reframe the discussion of extrapolation away from "validity" entirely. This word has the unfortunate implication of being binary; computer login passwords and driver’s licenses are either valid or invalid. To say that a password is “mostly valid" is to say that it is “not valid." Scientific knowledge is not binary, and while most practitioners can successfully keep this reality in mind when discussing “external validity," the term introduces unnecessary confusion.

https://osf.io/preprints/osf/nm7zr_v2

Expand full comment
Ben Recht's avatar

100% endorse. But this means we need a new term of art like "verisimilitude" for validity.

Expand full comment
Sam's avatar

Thanks for sharing this—insightful read! I would be curious to hear your positions on the recent influx of AI scientist systems like Sakana, Intology, and AutoScience working on near end-to-end automation of scientific discovery and paper writing for AI venues. What validity concerns do you have about these systems? Are there parts of the scientific research, peer review, and dissemination process where these systems might actually enhance internal, external, or construct validity evaluations?

Expand full comment
Ben Recht's avatar

I mostly ignore that body of work because every time I look, it ends up being snake oil.

But maybe you have a thought about how these systems could be used productively? I'm just saying that in all of the examples I've seen (on Twitter), it's been a lot of easily debunkable hot air.

Expand full comment
Mark Johnson's avatar

I am surprised that incremental hill-climbing experimental methodology is as successful and productive as it is in NLP and ML. But even though Sutton's Bitter Lesson seems a fairly accurate description of the field, it doesn't explain why Deep Learning succeeded where earlier approaches failed.

As fare as specifying a recipe for good science, I think the recent XKCD is pretty good https://m.xkcd.com/3101/ I like the mouse-over text: "If you think curiosity without rigor is bad, you should see rigor without curiosity."

Expand full comment