Power, Corruption, and Lies
What are we measuring with randomized controlled trials?
Neyman’s potential outcomes give us a powerful tool to measure the effects of interventions. The hard part is figuring out what exactly we are measuring.
In Neyman’s framework, we have a table of outcomes that we’d like to interrogate but can only reveal one entry per column. Yesterday, I described how randomized trials help us measure properties of this table. But there are still many questions lurking here that make these measurements a challenge.
The first question: how many patients should we enroll in order to measure the effect size? This is tricky even for simple binary outcomes. To get an a priori guess of the sample size required for your precision, you need to guess how frequently the outcome occurs. You need to declare what sort of relative risk reduction will be acceptable. You put these numbers together and get a “power calculation”
Prevalence is how frequently the bad event occurs in the general population. Effectiveness is the number we use with vaccines, equal to one minus the relative risk reduction. Precision is the number of standard deviations from the mean we are willing to tolerate. For the infamous 0.05 p-value threshold, the precision is 2. Why do we never talk about how making the precision 5 makes the p-value 0? If we made trials six times larger, their results would be “statistically” unambiguous. Six-fold is not even an order of magnitude! People usually reply to such requests to increase their sample size by lamenting how hard running trials is even when they are “powered” for the 0.05 level. But this is a cop-out. We have many medical interventions whose effects lie outside the 5-sigma window: vaccines like the Salk vaccine (z-score~7) and covid vaccine (z-score ~ 20) are great examples. But there are other examples. The leukemia drug Imatinib had a z-score of 17 in its primary outcome. Moreover, if it’s too hard to run trials powered at the 5-sigma level, then maybe trials are the wrong tool. If it is impossible to run trials with unambiguous measurements, maybe we should stop making excuses and realize that these shouldn’t be a “gold standard of evidence?”
Because these statistical issues cloud over much more important questions. What do the entries in the potential outcomes tables mean? These outcomes are measurements themselves! Can we actually measure them unambiguously? In a medical trial, we aim for a clear outcome under a clear intervention. For example, the patient takes a pill, and then we see if they have a heart attack within five years. This mapping from treatment to outcome seems mostly uncontroversial to measure. But what if they die in an accident in year 1? What do we mark in our table? What if they have a heart attack on the first day of year 6? Heart attacks are one of the more unambiguous endpoints, and there are already some hard questions to answer. As we move into more nebulous questions (e.g., progression-free cancer survival), these issues only get worse.
This leads me to a third question: what does a randomized-trial measurement actually tell us about the treatment? The measurement is a difference in outcomes in a group of individuals. We then have to have some reason to expect that this measurement tells us something about other individuals. There is absolutely nothing in the design of a randomized control trial that guarantees this. The transportability of the results of a trial requires a leap of faith that we make before we even set up the experiment. We tell ourselves that whatever we do here informs us about what the intervention does in new people after the fact. But how do we know? We’re bringing other information with us about plausibility, repeatability, etc. This is why I am so bothered that conventional wisdom says randomized trials are about causation. We have already made up our minds about causation before the trial.
So how do we proceed? There are at least three answers: the first answer is to “do better science.” Obviously, we all agree with this one, but accept that it’s a challenge. Everyone always thinks they are doing their best (and perhaps think that it’s everyone else who’s fucking up). I enjoyed Andrew Gelman’s blog about this from a few weeks ago. He’s right on all points, but why don’t more people adhere to his (sort of obvious) suggestions?
Unfortunately, there is far more tendency to pursue the second answer: deploying “observational causal inference methods.” As everyone well versed in these methods knows, observational methods are nothing more than telling stories pretending you did a randomized trial in the first place. It seems ill-advised to think thought experiments are helping us around the fundamental issues in experiments themselves.
The third answer is to become a bitter critic that everyone says nice things about but doesn’t listen to.
Maybe there’s a fourth answer? A path out of the planet of randomized trialomania?
Let me close the week with a question for you all reading this. Down which path should I take this blog next week? Let’s choose between 3 and 4. Should I write some bitter criticisms of observational methods? Or should I attempt to work through some of my rather half-baked ideas about the potential fourth path? Let me know in the comments or by email or on Twitter!
Thanks for reading arg min substack! Subscribe to find out which fork your faithful blogger chooses.