This post contains some of my own thoughts reacting to Paul Meehl’s course “Philosophical Psychology.” I’m taking a brief interlude from my run-down of Lecture 7. Here’s the full table of contents of my blogging through the class.
Meehl’s course has already emphasized that significance testing is a very weak form of theory corroboration. Testing if some correlation is non-zero is very different from the earlier examples in the course. Saying “it will rain in April” is much less compelling than predicting next year’s precise daily rainfall in a specific city. It’s frankly less compelling than predicting a numerical value of the pressure of a gas from its volume and temperature. I’m a bit reluctant to plead for a “better” form of significance testing. Part of the issue with the human-facing sciences is the obsession with reducing all cause and effect, all experimental evidence, to Fisher’s exact test. Randomized controlled experiments are a particular experiment design, not the only experiment design. Someday, we’ll all break free from this bizarre, simplistic epistemology.
But that won’t be today. Let me ask something incremental rather than revolutionary for a moment. What would null hypothesis significance testing look like if we took crud seriously? We know the standard null hypothesis (i.e., that the means of two groups are equal) is never true. What seems to be true is that if we draw two random variables out of a pot, they will be surprisingly correlated. If that’s true, what should we test?
Here’s a crudded-up null hypothesis:
H0: Someone sampled your two variables X and Y from the crud distribution.
We could ask what is the probability of seeing a recorded correlation if H0 is true. What would the test look like? We’d need to compute a distribution of the potentially observed Pearson r values. Since we’re working with finite data, that distribution would be the convolution of the distribution of a sample correlation coefficient r (perhaps making a normal assumption) with the crud distribution. While you probably couldn’t compute this convolution in closed form, you could get a reasonable numerical approximation. The “p-value” now is synonymous with how far your data’s correlation is into the tail of this computed empirical crud distribution. If it’s more than two standard deviations from the mean crud, maybe you’re onto something.
Note that this sort of testing can’t cheat by growing n. In standard null hypothesis significance testing, a small correlation will be significant if n is large enough. But big n does not mean you’ll refute the cruddy null hypothesis. In fact, all that happens with growing n here is the “empirical” crud distribution converges to “population” crud distribution. That is, the convolution doesn’t change the distribution much. When n is moderate, you will be more or less testing if your correlation is more than two standard deviations away from the mean of the crud distribution.
Again, I don’t think this cruddy null testing solves everything, but it is definitely better than what we do now. We should know what is a reasonably low bar for an effect size. We should power our studies to refute that low bar. This doesn’t seem like an unreasonable request, does it?
What stops this from happening is that we don’t seem too enthusiastic to measure these crud distributions carefully. What would that look like? Since the crud distribution is a distribution of correlation coefficients, we’d need to find a somewhat reasonable set of pairings of treatments and control variables specific to a field. We’d need reasonable datasets from which we could sample these pairings and compute the crud distribution. To me, this sounds like what Meehl and Lykken did in the 1960s: finding surveys with candidly answered questionnaires and tabulating correlations. In 2024, we have so many different tabulated spreadsheets we can download. I’m curious to see what crud we’d find.
For people who are familiar with his writing, I don’t think my suggestions here are different than Jacob Cohen’s. In the 1960s, Cohen tried to formalize reasonable standardized effect size measures and use these to guide experiment design and analysis in psychology. One of Cohen’s more popular measures, Cohen’s d, is more or less equal to twice the correlation coefficient:
Cohen asked that people compute d, and then evaluate the effect on a relative scale (small effects are d<0.2, large effects are d>0.8). One problem with Cohen is he assumed the scale for d was universal. But it certainly varies from field to field. It varies within fields as well, depending on the questions you’re asking. As I noted yesterday in epidemiology, we will always have Cohen’s d less than 0.2 for diseases like cancer. So to merge Meehl with Cohen, we’d need to look at the right distribution of effect sizes of random interactions and use this to set a relative scale for the believability of stories about correlations.
After my dives into the history of machine learning, I’m not at all surprised that I’m rediscovering sensible advice from the 1960s. In fact, I wrote a book about why we keep reinventing ideas from the Cold War that will be out next year. (More on that later). My point today is that some ideas from the 1960s shouldn’t go out of style. Everyone pays lip service to Cohen, but then he gets ignored in practice. Cohen laments this disregard in the preface to the 1988 edition of his book. Perhaps this means that incremental changes aren’t the answer, and the system of mindless significance testing exists to maintain a powerful status quo. If that’s the case, maybe we need a revolution after all.
Wait! Didn’t we have a revolution? You know, a “credibility revolution?” Did that fix anything? Let me take on that question in the next post.
What is the role of EDA (Exploratory Data Analysis) in all this? At a minimum it would seem that this should tell practitioners something about their model assumptions. I'm always a little suspicious of blithely assuming normality for most of these distributions.
Super interesting article, Professor! What would be the best method for approximating the “crud” distribution? I would intuitively guess bootstrapping your data, but I’m not sure if this runs into the same problems you mentioned previously.