Neyman’s potential outcomes give us a powerful tool to measure the effects of interventions. The hard part is figuring out what exactly we are measuring.
In Neyman’s framework, we have a table of outcomes that we’d like to interrogate but can only reveal one entry per column. Yesterday, I described how randomized trials help us measure properties of this table. But there are still many questions lurking here that make these measurements a challenge.
The first question: how many patients should we enroll in order to measure the effect size? This is tricky even for simple binary outcomes. To get an a priori guess of the sample size required for your precision, you need to guess how frequently the outcome occurs. You need to declare what sort of relative risk reduction will be acceptable. You put these numbers together and get a “power calculation”
Prevalence is how frequently the bad event occurs in the general population. Effectiveness is the number we use with vaccines, equal to one minus the relative risk reduction. Precision is the number of standard deviations from the mean we are willing to tolerate. For the infamous 0.05 p-value threshold, the precision is 2. Why do we never talk about how making the precision 5 makes the p-value 0? If we made trials six times larger, their results would be “statistically” unambiguous. Six-fold is not even an order of magnitude! People usually reply to such requests to increase their sample size by lamenting how hard running trials is even when they are “powered” for the 0.05 level. But this is a cop-out. We have many medical interventions whose effects lie outside the 5-sigma window: vaccines like the Salk vaccine (z-score~7) and covid vaccine (z-score ~ 20) are great examples. But there are other examples. The leukemia drug Imatinib had a z-score of 17 in its primary outcome. Moreover, if it’s too hard to run trials powered at the 5-sigma level, then maybe trials are the wrong tool. If it is impossible to run trials with unambiguous measurements, maybe we should stop making excuses and realize that these shouldn’t be a “gold standard of evidence?”
Because these statistical issues cloud over much more important questions. What do the entries in the potential outcomes tables mean? These outcomes are measurements themselves! Can we actually measure them unambiguously? In a medical trial, we aim for a clear outcome under a clear intervention. For example, the patient takes a pill, and then we see if they have a heart attack within five years. This mapping from treatment to outcome seems mostly uncontroversial to measure. But what if they die in an accident in year 1? What do we mark in our table? What if they have a heart attack on the first day of year 6? Heart attacks are one of the more unambiguous endpoints, and there are already some hard questions to answer. As we move into more nebulous questions (e.g., progression-free cancer survival), these issues only get worse.
This leads me to a third question: what does a randomized-trial measurement actually tell us about the treatment? The measurement is a difference in outcomes in a group of individuals. We then have to have some reason to expect that this measurement tells us something about other individuals. There is absolutely nothing in the design of a randomized control trial that guarantees this. The transportability of the results of a trial requires a leap of faith that we make before we even set up the experiment. We tell ourselves that whatever we do here informs us about what the intervention does in new people after the fact. But how do we know? We’re bringing other information with us about plausibility, repeatability, etc. This is why I am so bothered that conventional wisdom says randomized trials are about causation. We have already made up our minds about causation before the trial.
So how do we proceed? There are at least three answers: the first answer is to “do better science.” Obviously, we all agree with this one, but accept that it’s a challenge. Everyone always thinks they are doing their best (and perhaps think that it’s everyone else who’s fucking up). I enjoyed Andrew Gelman’s blog about this from a few weeks ago. He’s right on all points, but why don’t more people adhere to his (sort of obvious) suggestions?
Unfortunately, there is far more tendency to pursue the second answer: deploying “observational causal inference methods.” As everyone well versed in these methods knows, observational methods are nothing more than telling stories pretending you did a randomized trial in the first place. It seems ill-advised to think thought experiments are helping us around the fundamental issues in experiments themselves.
The third answer is to become a bitter critic that everyone says nice things about but doesn’t listen to.
Maybe there’s a fourth answer? A path out of the planet of randomized trialomania?
Let me close the week with a question for you all reading this. Down which path should I take this blog next week? Let’s choose between 3 and 4. Should I write some bitter criticisms of observational methods? Or should I attempt to work through some of my rather half-baked ideas about the potential fourth path? Let me know in the comments or by email or on Twitter!
requesting door 4 !
I'm also in favour of 4.
I think Gelman has been very important in social science. He can be tactless, he can be Difficult, I think. But he's been vocal for years, criticizing bad science, criticizing noisy estimates with large standard errors and little theoretical reason to believe them. I think that has been helpful to social science.
But Gelman also does lots of work with observational data. He'll model, for example, vote as a function of state, income, education, gender etc. using the kind of logistic regression model you do not like. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9203002/
So that's something I was wondering: how do we think of this kind of work from the perspective you are developing in this blog?
From his essay on Causality and Statistical Learning Essay (https://www.journals.uchicago.edu/doi/epdf/10.1086/662659):
"On one hand, I do not see how one can get scientifically strong causal inferences from observational data alone without strong theory; it seems to me a hopeless task to throw a data matrix at a computer program and hope to learn about causal structure (more on this below). On the other hand, I recognize that much of our everyday causal intuition, while not having the full quality of scientific reasoning, is still useful, and it seems a bit of a gap to simply use the label “descriptive” for all inference that is not supported by experiment or very strong theory. [...] For example, my own work demonstrates that income, religion, and religious attendance predict voter choice in different ways in different parts of the country; my colleagues and I have also found regional variation in attitudes on economic and social issues (Gelman et al. 2009). But I don’t know exactly what would be learned by throwing all these variables into a multivariate model."
I think it's a reasonable and measured perspective. But I think it opens the door to saying something like "religion causes vote choice”. Or maybe not “cause” but “explain” or “predict”? But it sounds all very metaphysical, but it’s still something Gelman finds worth doing!