One hundred years of experimentude
Revisiting Neyman's seminal paper on the design of experiments
2023 marks the 100th anniversary of the scientific method. At least, that’s what those of us with statistics on the brain have etched into our subconscious. 100 years ago, Jerzy Neyman published “On the Application of Probability Theory to Agricultural Experiments” in a Polish journal called the Annals of Agricultural Sciences. The world would never be the same.
I highly recommend you give this paper a read. The translation from 1990 by Dorota Dabrowska and Terry Speed is very accessible to a modern stats-minded person. And for whatever it’s worth, it’s markedly clearer and simpler than Fisher’s 1925 Design of Experiments, which would initiate a parallel architecture for experimentation a few years later. Thankfully, there are no tea ladies in Neyman’s manuscript.
Neyman poses the experimental model that we now call potential outcomes. He describes a randomized algorithm to measure properties of the world that are subject to decisions. The key insights in this paper are that randomized controlled experiments are a form of measurement and that their treatment effects can be estimated by combining combinatorics and probability. These insights are now the backbone of social and biomedical scientific practice.
Neyman focuses his examples on “field experiments,” by which he literally means experiments on crops done in fields. But my personal interests steer me to describe Neyman’s idea in terms of randomized clinical trials.
When we run a clinical trial, our goal is to measure the efficacy of some treatment. What does this mean? In potential outcomes, we envision the following model. We first posit an outcome we’d hope to avoid. For example, heart attacks within five years of entering the trial. For each patient in the trial there are two hypothetical paths: if they get the treatment, there will be some outcome. If they don’t get the treatment, there will be a potentially different outcome. We can assemble all of the potential outcomes into a table like this:
Here Y denotes the outcome occurring, and N denotes the outcome not occurring. If we knew this table in advance, we’d be able to decide whether or not to prescribe the treatment. Patient E will never have a bad event, so there’s no need to treat this person. Patients C, D, and F will have a bad outcome no matter what we do. But if we knew this table in advance, we could give medicine to A, B, G, and H and prevent the bad outcome from occurring.
Of course, within the confines of reality, we are not telepathic. We can only observe one of these outcomes per column. We can not observe what happens to a particular individual if they get the treatment and if they don’t. So it’s impossible to know whether a treatment will help an individual in advance. But Neyman proposed a method to measure whether a treatment is helpful on average. From the table, we see that bad outcomes happen ⅜ of the time when these patients get the treatment and ⅞ of the time when the patients don’t get the treatment. We say that the absolute risk reduction of the treatment, equal to the difference between the rates of bad outcomes, is ½.
Neyman realized that these rates and the risk reduction itself could be measured with a randomized algorithm. We could sample patients without replacement—say (A,B,D,H)— and give them the treatment. In turn, this would result in a sampled group without replacement—here, (C,E,F,G)—that we could give the standard of care.
We could then observe the outcomes under these sampled assignments
The tricky part of a clinical trial is we cannot observe the effect of a treatment on an individual. A patient can only get treatment or standard of care, not both. Hence, half of the table of outcomes is populated by question marks, denoting outcomes that are not knowable.
But, we can still estimate the rate of bad outcomes from each group by looking at the mean of what we did observe. Here, we see a bad outcome event of ¼ in the treatment group. Similarly, we estimate the rate of bad outcomes under standard of care as ¾. We can then estimate the absolute risk reduction by taking the difference of these two numbers, suggesting an absolute risk reduction of… ½.
This procedure is a randomized measurement of the absolute risk reduction. It’s precisely what is reported in every trial in the New England Journal of Medicine today. It’s also what you report in the AB Tests you run at your tech company. Neyman was even more prescient about why this was a powerful measurement: for a variety of different outcomes, you can compute the standard deviation of the measurement process.
In this case, the simplest variance estimate comes from estimating the errors in measuring the outcome rate in each group individually. As Neyman puts it, this process is identical to sampling with replacement from urns with two colors of balls. That is, the distribution is hypergeometric. If we had enough patients, we could pretend each rate estimate was Gaussian, compute the variance using a normal approximation, and then find 2-sigma confidence intervals. This is more or less what you get when you run a Proportions Z-test. A more rigorous confidence interval is given by Li and Ding (in 2015!!!). But it’s tough to construct potential outcome tables where the rigorous intervals look markedly different from Neyman's naive normal approximations of variance in 1923.
Neyman’s paper has no p-values. For Neyman, randomized experiments were measurements. Statistics could be used to quantify measurement error. And what you do with the given noisy measurement was left to the experimenter’s discretion. We might have been better off if we just stopped there.
Another tangent, but this reminded me that a while back there was a big twitter spat between Judea Pearl and "trialists". A misconception I realized I had, that perhaps should have been obvious to me, is that in an RCT the treatment assignment is randomized, but the sample of participants is (almost always) not. So, when interpreting the results of the trial, the estimates of, e.g., the ARR are specific to the sample of participants in the trial. Given what little I know about trial enrollment, it doesn’t seem like we should have much confidence in the generalizability of such results to a new population (I would be happy to be wrong here!).
Further, RCT statisticians have developed their own language of complicated tools (insert your favorite combination of “cluster”, “block”, and “crossover” before “trial design”) for experiment design. It seems that the magic—the intellectual achievement here—is in being able to estimate the treatment effect of a sample despite missing counterfactual outcomes. And while this is certainly impressive, it doesn’t seem to help one answer questions like “will this treatment work on my patient? Will this treatment work on me?”. How does generalizability/transportability of effect estimates come into play here? This, to me, seems like a very important piece of the problem.