This post digs into Lecture 7 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.
Today, let’s breeze through Meehl’s final four obfuscators of observational null hypothesis testing. This can be breezy in part because Meehl has already spoken at length about two of them (here and here): Selective Bias in Submitting Reports, and Selective Editorial Bias. Let me not spend more time on them today.
So we’re down to two, Pilot Studies and Detached Validation Claims.
Pilot Studies
Before you invest a ton of money in a major data collection effort, it makes perfect sense to run a baby study to see if there’s any hope of the result panning out. Such pilot studies are where you might test whether your code or device works, get some qualitative feedback on the design, and get a sense of how large the effect of your treatment is. Meehl argues that pilot studies are valuable and likely necessary exercises to nail down the technical foundations of a good experiment. I don’t know anyone who disagrees with this. However, if the outcome relies on null hypothesis testing, pilot studies have a pernicious paradoxical downside.
If the pilot doesn’t pan out, it could be because the pilot is underpowered! Pilot studies are necessarily small. They might be so small that they have a high false negative rate. In advance, you don’t know how large the effect should be. So unless the intervention works without fail, the pilot might yield no effect. Since it’s just a pilot, researchers are more willing to file-drawer the finding and move on to their next clever experimental idea.
On the other hand, given the crud factor, false positives will be abundant in pilot studies. Taking power functions seriously, researchers will make their full studies big enough to surely replicate their pilot findings. That is, pilot studies might influence researchers to set their main study size large enough to reject the null hypothesis because of a crud effect.
This leads to a bit of a nightmare. Good theories are getting screened out at the pilot stage due to insufficient power. Bad theories are getting accepted at the main stage due to crud. If this were the case, random theories with no verisimilitude would be consistently corroborated in published results.
Detached Validation Claims
Meehl’s final obfuscator is that we forget that measurements are often very imperfect representations of the treatments and outcomes we aim to study. For example, in psychology, many measurements come from psychometric tests. These tests have multiple, universally accepted issues. First, the correlation between the test score and the trait you care about is often low. Test builders might find a Pearson r as low as 0.4, but still deem the test useful enough for some aspects of clinical practice. To make matters worse, test-to-test reliability is often low, with the Pearson r between two versions of the same test–or even two administrations of the same test–being as low as 0.8. This means that the test scores are often a weak proxy for the trait you are testing.
This weak correlation is problematic, but it’s even worse when researchers forget it is low. Meehl notes that in psychology, researchers will write in the methods sections that “this test was validated in reference 11.” But then they’ll just report hypothesis test results on the test scores, completely dropping that the test has far from perfect validity and reliability. With the numbers of the example I gave above, the true trait might have a fraction of the measured effect size using the imprecise test. That fraction could be as low as 0.1. Few significance tests pan out if you need to divide the z-score by 10.
More broadly here, the issue is understanding what the measurement is telling you about the outcome you care about. Across the board in the human-facing sciences, we are faced with imperfect outcome measurements. A personal favorite of mine is “progression-free survival” (PFS) in cancer studies, as no one seems to know what that outcome means with regard to the health and well-being of a patient, but you can get drugs approved if you can improve PFS.
Though Meehl was reticent to argue against them, the four final obfuscators also plague randomized trials and other interventional experiment designs. In fact, seven of Meehl’s ten obfuscators are major issues in general experiments. I could make the case that many randomized experiments are plagued by problematic ceteris paribus assertions, experimenter error, insufficient power, incorrect conclusions from pilot studies, selective bias in submitting reports, selective editorial bias, and detached validation claims. I could argue that the first two of Meehl’s obfuscators—loose derivation chains and problematic auxiliary theories—lead to poor experimental design choices and poor statistical analyses in interventional studies. So that’s 9 out of 10? Could I even make the case that crud might oddly impact randomized trials? Was Lykken right? Yikes. I’ll definitely come back to this.
But first, let me stick with observational studies. Given his long list of obfuscators, Meehl leaves us asking what we should do. One could argue “stop null hypothesis testing,” but no one wants to go as far as “shutter all quantitative research in social science.” In Lecture 8, Meehl proposes some fixes. It’s interesting to see which have been adopted, which remain untried, and which have had positive impact. In the next few blogs, I’ll not only talk through Meehl’s suggestions, but will propose a few of my own.
Thank you for this series. I'm not a researcher, and my knowledge of statistics is at the high school level. But even when I don't understand the details of your posts (I certainly skip any mathematics) I am heartened by the critical spirit of Meehl's lectures and your exposition.