Science is correlations with stories. Validity is the style guide for these stories. It is the acceptable rhetoric that scientific stories can use. Validity constrains how we talk about empirical results, forcing scientists to use a particular form of argumentation to make their case. Validity provides rules for what makes a story “good” science.
As it is typically taught, scientific validity is immediately broken down into two disjoint components: internal validity and external validity. A scientific study is internally valid if the evidence presented supports its main claims. It is externally valid if other scientists can apply the results to other studies. Though the term validity carries a heavy rhetorical baggage of truthfulness, you can see that what exactly we mean by “support” or “apply” is necessarily hard to pin down. And hence we have to rely on community expectations for working definitions.
You can see disciplinary norms most clearly when you dig into internal validity. Definitively showing that a scientific observation corroborates a hypothesis is a lofty goal. In practice, internal validity is reduced to a checklist. Were there obvious flaws in the experimental setup? Were there variables that the authors didn’t control for? Did the authors train on the test set?
This checklist might not be a set of necessary or even sufficient conditions for inferring good science. The associated community decided on this particular list because, at some time in the past, someone skipped one of these steps and bad science happened. For example, unblinding in a randomized trial might threaten internal validity. It’s possible that knowing which group a subject was in influenced the subject’s or trialist’s behavior in the experiment. Hence, we decided that preserving blinding in randomized trials is critical to mitigate bias.
On the other hand, some zealous science reformers argue for conventions with far less obvious necessity. My favorite example is preregistration. For its adherents, preregistration constrains “research degrees of freedom” so that people don’t data dredge and find spurious signals in their data. However, while writing a good experimental plan is worthwhile, preregistration infantilizes science into a woefully naive sequence of randomized controlled experiments. A “valid” preregistration plan necessitates knowing the outcome of all aspects of an experiment before conducting it. Preregistration makes it impossible to adapt to the actuality of experimental conditions.
In machine learning, we had our own internal validity conventions with suspect necessity. Validity was a major topic in our machine learning evaluation class, and realizing that these conventions were just conventions was the strongest takeaway for me. Taboos like adaptive overfitting to the test set are still taught as part of our rules for internal validity. More and more evidence comes in that this doesn’t compromise claims in the way we thought it did. The upside of these negative examples is that communities can adjust their standards. We can update our rigid validity checklists in light of new empirical data, like good scientists should.
External validity concerns whether results in one study apply to another study. In other words, are experiments reproducible in different contexts? The stories of external validity concern just how much we expect a result to hold as context changes. Which parts of a logical relationship should be the same, and which parts should differ in different settings? These stories are useful because they facilitate consensus about the mechanisms that truly make an intervention effective. If we accept that an original experiment was done in good faith, tests of external validity probe which parts of the experimental setup are needed to yield a similar result. How we describe the difference between a replication context and the original context is another set of stories.
Finally, there is construct validity, which is far more complex and more challenging to operationalize. It is the stories we tell about our interventions and measurements themselves. How do empirical artifacts connect to abstract concepts? There are tests of construct validity, but these tests necessarily rely on the specific expectations of the relevant scientific disciplines. The constructs of psychology differ from those in cell membrane biology, even though they both pertain to human systems.
These myriad facets of validity not only provide a style guide for writing but provide a framework for critique.1 Validity trains scientists to be maximally skeptical. How to look for flaws in the experimental setup, to look for violations of ceteris paribus assertions, to look for flaws in generalizability.
It’s critical for students to learn this! It’s how disciplines progress, for better or for worse. The social culture of research proceeds by finding bugs in the theoretical grounding of past results and doing new studies to patch the bugs. Scientists have to be trained to find bugs, and hence it is critical that any course on methods and evaluation covers the norms for what bugs are in the first place. Peer review, in its broadest, most general form, is done in the domain specific language of validity.
Internal validity is how you poke at people’s experimental setups, their statistical methodology, and the inherent biases in their designs. Construct validity attacks the meaning and interpretations of experimental interventions and outcomes themselves. External validity is a set of critiques against generalizability.
It’s helpful to have these scientific rules written out so everyone can agree upon a baseline for scientific play. We can (and do!) change the rules if they are preventing us from telling useful stories about correlations. At its best, validity enables improvisational creative work where scientists push against the boundaries of empirical techniques and serendipitously find new facts. At its worst, validity handcuffs researchers into dribbling out incremental results for CV padding. Like all academic systems, the rules of the validity game change slowly.
Addendum: Jordan Ellenberg wrote to me with a necessary rejoinder to this post: “I think you are unnecessarily harsh about dribbling out incremental results, which in my view is the fertilizer of science, both in the sense that it promotes growth and in the sense that some people see it as just a pile of shit.” He’s right! I need to figure out how to rephrase the second-to-last sentence here. To be continued…
Validity provides rules not only for Lakatosian Defense but Lakatosian Offense.
> “A ‘valid’ preregistration plan necessitates knowing the outcome of all aspects of an experiment before conducting it. Preregistration makes it impossible to adapt to the actuality of experimental conditions.”
This is definitely a large drawback of preregistration, but isn’t it just a constraint introduced by frequentist assumptions (of which I recognize you’re generally skeptical)? It would be nice to learn-as-you-go from data, and adapt your methods appropriately, but this does render hypothesis tests useless/biased. Given that this seems to be what researchers are currently doing, a mechanism that weakly-enforces concordance with a priori design/hypothesizing seems reasonable.
How else should we get around this?
Thanks for sharing this—insightful read! I would be curious to hear your positions on the recent influx of AI scientist systems like Sakana, Intology, and AutoScience working on near end-to-end automation of scientific discovery and paper writing for AI venues. What validity concerns do you have about these systems? Are there parts of the scientific research, peer review, and dissemination process where these systems might actually enhance internal, external, or construct validity evaluations?