This post digs into Lecture 6 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.
In Lectures 6 and 7, Meehl dives into what he’s known best for: a critique of observational data analysis based on null hypothesis testing. These lectures draw from his paper “Why Summaries of Research on Psychological Theories are Often Uninterpretable,” where he lists ten obfuscators that make it hard to assess published results. Why is it that no matter how many studies we do, we can come away from a meta-analysis with no idea whether a particular theory is true or false or not? Why do we end up with equivocal results stating “we conclude with low to moderate confidence that some factor may or may not have any relevance with the outcome in question.” Why does so much of science just amount to wasting everyone’s time?
Meehl makes it very clear that he is only speaking about observational studies, not randomized experiments. He remarks that his colleague David Lykken thinks the criticisms do indeed also apply to randomized experiments. For whatever it’s worth, Meehl thinks Lykken might be right, and I do too. Meehl wasn’t ready to drop the hammer just yet. However, Meehl’s critiques do apply to all “observational studies,” even the ones that use fancy statistics to pretend like they did a randomized experiment (this sort of fancy stats is what economists and their friends arrogantly call “causal inference”).
I’ve been meaning to write about the poverty of observational causal inference since I started this substack, and thank Meehl for finally presenting me the opportunity. Let’s spend a couple of weeks on why you shouldn’t believe any observational studies. We can then move on to see which of the critiques also apply to RCTs. And then next year, we can work on closing down the economics department.
Meehl’s obfuscators break down into four clean groups. The first four obfuscators are about derivation chains, the next three are about statistical correlations, the next two about research bias, and the final one about construct validity. Today, we can tackle the first four as they are a nice segue from the lecture on Lakatosian Retreat. And these obfuscators highlight a piece that was missing from the program. When your derivation chain from theory to outcome is not logically tight, then no amount of evidence can corroborate or refute your theory.
Recall, one last time, Meehl’s logical formula for scientific prediction:
In this model, we logically deduce a prediction that “If we see O1, then we see O2” from our theory T, our auxiliary theories AT, our instrumental auxiliaries AI, the ceteris paribus clause CP, and the experimental conditions CN. If we do an experiment and see O2 after O1, and if the likelihood we would have seen O2 after O1 absent the theory is small, our theory is corroborated.
But this all rests on the deduction chain being valid. One further attack on a scientific result is to go after the particulars of the derivation chain. Meehl’s first four obfuscators too widely apply:
The deduction chains are not explicit.
There are problematic auxiliary theories (unstated or badly formalized AT).
The ceteris paribus clause is almost surely false (easily deniable CP).
The particulars are imperfectly realized (murky CN).
That any of these four are show stoppers for corroborating a theory should be clear. If any of these are true, then the deduction chain does not logically imply that O2 follows from O1. A poorly described, poorly justified derivation chain combined with correlational evidence mined from some public data set doesn’t corroborate anything. When your theory is a murky mess, the negation of it is also a murky mess, and we can just end up being confused.
The first two obfuscators are saying that theories fail “robustness checks.” We end up in this silly game where authors write down a loose mathematical model for why O2 follows from O1, but don’t have a convincing reason for why that should be the relationship. They might say O2 corresponds to a measured quantity y. They quantify O1 in a covariate x. They write down equations
Where b is some parameter and e is an error signal. Then they assert e has some statistical properties, like being normally distributed. This model makes a bunch of huge leaps. Why is it linear? Why is that noise random? Why aren’t there other variables in the equation? How many specifications could be close enough to this theory while still being plausible?
Unfortunately, most published observational studies have this problem. There are deductive leaps from the theory to the model, the model is never valid, and there are dozens of plausible models that are just as good as the one written down in the appendix. This means the probability of O2 given O1 in the absence of the theory can change wildly depending on this specification. As a viral Twitter thread showed yesterday, most observational results in prestigious academic journals don’t pass modest robustness checks. And yet we keep publishing them.
Oh well, moving on! In our contemporary language, we can sum up Meehl’s third bullet as there are always hidden confounders. How can there not be? I find it hilarious when people explicitly state in their papers that they are assuming there are no hidden confounders. I mean, I appreciate the candor, I guess. But I don’t believe them! As Meehl puts it:
“It's really hard to conceive of a thing we do in soft psychology that involves correlational stuff in which you could say with any confidence there isn't any other systematic trait of humans (or any other demographic thing about them like their social class origin, their race, their age, their sex, their religion or their political affiliation) that's going to be a correlate of one of the factors that we're plugging into our design.”
How can you disagree with that? And how can you prove without a shadow of a doubt that the things you didn’t randomize and control didn’t cause the observational relationship you are seeing? I’ll say more about this when I discuss the wonderfully named “crud factor.”
Finally, there’s the imperfect realization of the particulars. Here you see Meehl decrying the replication crisis in 1989, decades ahead of the crowd. We’ve touched on this before when discussing the context of discovery. Experimenter bias is always a worry and manifests in surprising, unintentional ways. There are always parts of the experiment that don’t explicitly appear in the text. We can partially fix this with reproducibility standards. Sharing data and code pipelines helps a lot. But if you really want to be sure that the experiment is valid as written, you have to reproduce it. As we’ve seen, more often than not, most published studies are hard to experimentally reproduce.
Meehl argues that these first four obfuscators too often cast doubt on good theories. But how does he propose that scientists approach their research to prevent good theories from being abandoned? By being more rigid and logical? By doing better statistics? In the next posts, I’ll dig into two digressions Meehl takes in Lecture 6 in an attempt to infer his answers to these questions.
>Meehl makes it very clear that he is only speaking about observational studies, not randomized experiments.
But Meehl does extend his view to some randomized experiments in the 1990 paper: he says that any experiment where the target effect hinges on an interaction between the (randomly assigned) treatment and some other covariate that wasn't randomly assigned is essentially observational and the same critiques therefore apply.
Since I am involved in research that is much more qualitative, I often read both observational and controlled studies. I agree that observational studies are highly highly flawed, but I also think it receives unwarranted hate way too often (while controlled studies receive too much love by comparison). Here’s my argument to hopefully give some love for observational studies. In a socio-political context where there are many actors in different situations, they are put in an environment with artifacts and objects that highly influence their behavior. While all the actors may share similar goals for this study (as economists may do to simplify their analysis), the means in which they get to that goal varies greatly and is highly context dependent. Often times, the only way to uncover issues and problems in certain social/political/etc.. context is to do an observational study. Actors with different backgrounds and contexts are placed in an environment with similar artifacts. What do they have to do to get to a goal? How do they do it? And why do they do it the way they do it? The only way to do this is to observe these actors. Then if you can, you must talk to them. This is, in my opinion, how we identify problems. Now, say you found the problem, and you want to modify the environment to help address some of the problems. You hypothesize that by making this intervention, you may be able to see some changes. Again, in a socio-political world, a controlled study is just not possible. So you do this intervention, and you observe the changes again. You repeat the observational study again and perhaps you found some issues with this intervention. Then you rinse and repeat.