This post digs into Lecture 6 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.
I’ve been struggling all week to figure out how to blog about Meehl’s statistical obfuscators. In some sense, I’m guessing they are very obvious to you, dear reader. On one hand, studies should be large to make sure we aren’t rejecting good theories. On the other hand, in the social sciences, everything is correlated with everything, so if you make a study big enough, you’ll always reject the null hypothesis. These two contradict each other. You are damned if you do, and damned if you don’t. Meehl knows this. As he says at the beginning of Lecture 6, his ten obfuscators work in opposing directions to make the literature uninterpretable. Yet, this tension needs to be pulled apart because everyone is still arguing about the role of “power” in applied statistical work.
If you read enough critiques of papers, you might come away with the impression that an underpowered study is one where someone disagrees with the main conclusions. What does it mean to be underpowered? Why is that bad? It seems worth reviewing the practice of null hypothesis testing to try to set the stage for why power and crud combine to make the entire practice nonsensical.
In any experimental study, you want to ensure you have enough measurement power to detect what you are looking for. If you believe a star should be at some location in the sky, you need a telescope that can precisely point where you should look and a lens powerful enough to resolve the star.
In most hypothesis tests, we are trying to measure if the average value of some quantity is larger in one population than another. Is the average adult taller than the average child? Is the average number of deaths from a disease greater among patients who receive treatment A or treatment B? Does my website make more money on pages where there is a blue banner or a green banner? To measure this difference, we’re going to get N people from group one and N people from group two. If the difference between the means of the two groups is big enough, we’ll conclude there’s something going on there.
Power is the statistical analog of measurement precision. In a statistical study, precision is dictated by the number of samples. We first assume that every value we record is a random number. This assumption is useful because the more random numbers we average together, the more precise our estimate of their mean becomes.
A power calculation estimates how many samples we need to observe to assure ourselves that the means are different. There are two main probabilities we use to calculate this number of observations. We imagine two different hypothetical scenarios. First, assume the intervention you care about does nothing. The size is the probability that we run our experiment and get a result that erroneously concludes the means are different. (It is the probability of rejecting the null hypothesis assuming the null hypothesis is true). Now assume the intervention we care about actually works. The power of a test is the probability that we get a result that correctly surmises that the intervention does something. (It is the probability of rejecting the null hypothesis assuming the intervention has a particular effect). The power will increase with the number of samples in the experiment, and a power calculation finds the number that gives us the desired power and size.
How does this play out in practice? Let D denote the measured difference between the means of the two groups. From our data, we can also estimate the standard deviation of the estimated difference. Call that S. Almost all null hypothesis tests work by rejecting the null hypothesis when D>2S.
Why we use the number 2 doesn’t matter. It supposedly corresponds to a probability of 5%, but these probabilities aren’t real. The threshold of 2 is just a very sticky convention that we all mindlessly accept. The important part is that S gets smaller with more samples. If our mean difference is truly not zero, we’ll find that D>2S if our sample is large enough.
But how large is large enough? This is where the power comes in. We first come up with a number T which is the minimally tolerable mean difference we’d accept. If it’s smaller than T, then we don’t practically care about the difference. This is our “risky prediction,” if you will. We want to make sure that we have enough data so that we’ll see D>2S in our sample if the true mean difference is T and the true standard deviation of the estimate is SE. So how small should SE be?
Statisticians returned to their chambers, donned their hoods, ran through their incantations, and decided that we need T > 2.8 SE. This will “guarantee” that the size is 5% and the power is 80%. To reiterate, to do a power calculation, we declare in advance a tolerable difference and compute the standard deviation of your estimate. Our standard deviation calculation will be some function of the number of samples. We choose the number of samples so that the tolerable difference is 2.8 times larger than the computed standard deviation. Amen.
Usually, these calculations are given in simple formulas for a person to read off, so they don’t have to think about what power calculations mean. Or even better, you can just run them in conveniently accessible software. You can just do your own personal incantation in your office, using R or STATA or some web calculator or whatever. Ask ChatGPT.
Let’s do one here! Though we’re supposed to be focusing on observational studies, let me use a hypothetical randomized trial as an illustration. Suppose we have a vaccine that we think prevents a disease. Our null hypothesis is that the vaccine does nothing. How large does the trial need to be to reject the null hypothesis when the treatment is effective? We can get a back of the envelope calculation if we know some things in advance. First, we guess the prevalence of what percentage of people will catch the disease without the vaccine. We assert a tolerable level of risk reduction for the vaccine. A 100% risk reduction means it always works. A 0% risk reduction means it does nothing. A 50% risk reduction means half the prevalence in those who receive the vaccine. And a 20% risk reduction means four-fifths of the prevalence in the vaccinated group. If we want to have a power of 80% and a size of 5%, the number of people we need to enroll in the trial is
We get a nice and tidy formula that we can compute. If the prevalence is one in a hundred and the tolerable risk reduction 50%, then we’ll need to enroll about 13,000 people in the study. Yes, thirteen thousand.
Now what is this calculation guaranteeing? We’ve made a lot of assumptions about things we don’t know in advance (here about prevalence). But if we believe all of the modeling, we’re going to run a 13,000 person study and have a 20% chance of having a perfectly good vaccine but being unable to reject the null hypothesis.
Wait, 20% seems high to me! Hmm.
In sum, we have a bunch of confusing calculations that are hard to explain, a bunch of assumptions about experimental conditions that are hard to verify, and a one in five chance of having wasted our time after running a huge study. And this calculation was for a randomized trial with a concrete, manipulable intervention. As I’ll describe in the next post, the problem only becomes worse when we move to observational studies.
This is a meaty post that might benefit from expansion into multiples.
In some sense, I think there is a conflation between, on the one hand fundamental epistemological problems that come from wanting things we can't have, and on the other hand, with "bad statistics" problems which are (somewhat) ameliorated with (for example) a more Bayesian approach?
I think the basic epistemological nightmare is that even in the best case, detecting small differences is hard and can only be done with some combination of enormous sample sizes and substantial probabilities of failure. For instance, the power calculation you've done is (abstractly) about detecting the difference between a coin that comes up heads one percent of the time and a coin that comes up heads half a percent of the time. It's not surprising you'd need thousands of samples to do that with a moderate chance of success!
If you're a HFT hedge fund and you get to take gazillions of identicalish bets, this is fine --- you're delighted with high-variance expected positive value bets. But if you're trying to approve a vaccine or a drug or intervention, hoo boy. If the overall population effect is small (either because the intervention only does a little or it only does it for a small fraction of the population), you're basically out of luck, even before you bring all the problems related to how the real world doesn't meet your abstract assumptions.
But then on top of that, we add other dumb things, and those are unforced errors. For most non-physics things, "Are X and Y different?" is just the wrong question. The right question is "Are X and Y different by some amount T that matters substantively to a downstream decision maker?" I think Gelman gets this bit right when he says (roughly) "In social science, the null hypothesis is always false." And this connects back to how this is connected to decisions --- for social science the "decision" is often whether the paper gets published or not, so that messes with the incentives.
TL;DR: For social science questions I'm basically only interested in effects that are big enough to show up obviously in the plots. If you need statistics to tease it out I'm not interested?
I am looking forward to your next post in this series. I spent my academic life in one of the major and prestigious surgery departments in the USA. It was very RARE to encounter anybody who understood the sensible use, and the actual “nuts and bolts”, of hypothesis testing. In particular, I always have found, and still do find, it very difficult to perceive whether the use of hypothesis tests *with purely observational data* is “kosher” or not. There must be multiple concealed assumptions involved? Epidemiology (my “amateur” hobby) is filled with four-zillion uses of these tests featuring their use to detect whether some given Association of two variables was or was not “significant” and then to “calibrate the strength of the association”. What in hell does that all mean ? I think it means that folks doing this must be handling their data AS IF they were fruits of a randomized trial (? !).