This post digs into Lecture 8 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.
In Lecture 8, Meehl puts forward a few suggestions on how to make the scientific literature more interpretable. He is not proposing ways to make research “better.” We can never get to the space where we’ll all agree. If we didn’t disagree, we wouldn’t be doing science in the first place. But the obfuscators Meehl listed in Lectures 6 and 7 made it so that the social scientific literature provided almost no information about what’s true and false. That means we can’t even have sensible arguments. Meehl wants the literature to better inform arguments about theories and beliefs about the effectiveness of treatments. Meehl’s suggestions in Lecutre 8 are to make Lakatosian warfare possible.
I’ve honed Meehl’s suggestions down to three main themes.
Improving reproducibility
Moving beyond hypothesis testing
Publishing less
Long-time readers will note that I have already blogged favorably about those three suggestions. Is that because I’m hearing what I want to hear when I listen to Meehl? Or is that because he was screaming into the void in the 80s, and we should have listened to him then? Could it be both? Let’s all buckle up for the next three posts to find out! I want to explore which of Meehl’s suggestions were heeded and what we might do to further implement them today.
Improving reproducibility
Meehl’s was banging the drum for better reproducibility earlier than most. He suggests that investigators should even be required or strongly encouraged to replicate their own results. They could publish the results of a pilot study alongside the main study. A study would be more compelling if it provided two independent measurements of the same effect on two dissimilar datasets. Requiring two measurements would necessarily set a higher bar for publication, but it would also ensure considerably stronger evidence for the tested theory.
While asking for more experiments is a high bar, asking for more information is not. And that’s Meehl’s second suggestion. Meehl wants journals to require authors to provide more information. It’s amazing to hear the sorts of practices that seemed to be allowed in the 1980s. Meehl claims that people could report “significant at some level” and never tell you the mean difference between the groups. That seems preposterous. It’s not much better to report that mean difference with only an asterisk denoting significance. This was also common place in social science. The reader would never see the standard errors or the p-values. I’d be curious to hear from folks in psychology how common these practices remain.
It’s so cut and dry that this shouldn’t be allowed. If you’re going to run a hypothesis test, why not report everything? Say what the test is. State the standard error. State the p-value. Give a confidence interval. Meehl is right that you can compute any of these numbers from any other, but you should at least report one to high enough precision to do so. And why force a reader to open R or Python? Just list them.
Beyond this, Meehl says that papers should give a sense of the shape of the distributions of the two groups. Pictures are probably even more informative than the raw statistics. We know that most natural phenomena are not really Gaussian and linear. Investigators should plot the histograms so that people can understand group overlap and skew. Visual statistics are far more compelling and informative than test statistics.
Meehl additionally argues that investigators should be required to measure and report nuisance variables that they don’t think should be causally affected by treatment. If the measurement of the effect size is on the same order as the nuisance variables, perhaps this means that the study failed to corroborate the investigator’s theory. Here, I’d argue we’re in a better state now than in the 1980s. Most epidemiology papers I’ve looked at have extensive tables of variables comparing different groups. They tend to list the associated p-values with the group differences. Economics papers now have hundreds of pages of sensitivity and robustness checks to validate their causal claims. Papers come with all sorts of pretty plots. This is all a step in the right direction.
But it’s not enough for replication. Meehl is asking for as much information as possible in a paper. Why not take this to its logical limit? In 2024, there is no excuse for papers not to come as git repositories. Every paper should include a repository of readable, runnable, commented code and as much data as possible. Ideally, this repository should trace all steps from data extraction to statistical analysis. The data should be in its most primitive, unaltered state. This way, the interested reader can view the data from whatever angle they want. The authors of the paper can make their argument about what we should see, but everyone else should be able to run their own analyses.
I’d argue that we should make the papers themselves shorter! I don’t want to flip through people’s robustness analyses in an endless pdf file. I’m not sure why anyone puts up with these appendices. I mean, at this point, don’t we all think it’s odd that robustness analyses always come back in the author’s favor? There’s a reasonable alternative to such exhaustive sensitivity analyses. Just give out your code and data so the skeptic can see what’s under the hood. And if investigators were really committed to their robustness checks, they could include them in a folder in their repository in a nice interactive notebook. I’m all for it.
There are no good arguments against this sort of reproducibility. Certainly, “proprietary data” is an absurd argument. If your data is proprietary, I don’t believe your results. You are trying to sell me something, so no paper for you.
A more tricky argument is made in medical research: data can’t be released because “privacy.” This argument derives from a mindless, shallow reading of the Belmont Report. I fully endorse that respect for people and beneficence dictate that investigators respect people’s desire for privacy in studies. But how real are the privacy concerns behind revealing counts in randomized trials? Why can you request the data from drug trials from the FDA but not device trials? Why are other clinical trials or random EHR data mining exercises impossible to access? Does it actually benefit patients in the study that we can’t check investigators’ work? Does privacy outweigh the potential for hiding fraud? We should discuss these questions seriously and in depth.
Now, I’m actually optimistic here. One of the few good things to come out of the international covid response was a broader embrace of preprint servers by the human-facing sciences. If medicine can embrace preprints, they can embrace code sharing and open data too. The future of scientific publication must bend towards open repositories. We’re on the right track there, but let’s continue to pressure our colleagues to keep moving in the right direction.
Loose ends
Meehl starts off the lecture with modest advice that is so uncontroversial that it’s astounding it’s still often not taken. Though aimed at observational studies, these suggestions should also apply to every randomized trial or other interventional experiment. First, every investigation should begin with an estimate of the effect size needed to strongly corroborate the proposed theory. A mere directional prediction is far too weak. Second, studies should be powered at the 90% level to detect this effect. Third, that power calculation should be explicitly written down.
This is all perfectly reasonable, and I’m sure almost every methods class teaches something along these lines. And yet I found a bunch of violations of these principles in a cursory glance at my Zotero this morning. Though power concerns used to bug me, I’ve become more relaxed about this over time. These particular suggestions are just lipstick on the hypothesis-testing pig. Patched-up hypothesis testing is still just hypothesis testing. Hypothesis testing is the problem! That’s probably why Meehl doesn’t dwell too deeply on it. And that’s why I’ve relegated the discussion to this footnote.
Lipstick on a pig is right. Or maybe deck chairs on the titanic.
Refining the ritual of science still means treating science as ritual.