The Impact of Actions
Introducing the randomized experiment as a measurement device for policy optimization.
The last three lectures have focused on optimal decision-making from data. But our model assumed we knew the value of each action. In the next few lectures, we’ll look at how to estimate the value of actions themselves.
Let’s say we have a single action with an unknown reward. We’d like to compare this to the reward of inaction. Hey, this is a machine learning class, so we could try to machine learn the mapping from action to reward, right? We could collect a bunch of data where we do the action and collect a bunch of data where we don’t take the action. Since we’ve been belaboring the point of having good internal validity, we could even choose to act on a random set of half the data, call it A, and not act on the other half, which we call B. Then we could fit a model using empirical risk minimization to predict reward. If we use linear regression on the simple model where the only feature is action, we’ll find that the coefficient of the action is the expression
This prediction compares the difference between the average effect in group A (which we might call the treatment group) and the average effect in group B (which we’ll call the control group). Lo and behold, we have used machine learning to reverse engineer the randomized controlled experiment.
Am I just totally taking the piss this morning? I’ve spent the semester belaboring that machine learning is what you do when you don’t have good models for some predictive function. And you replace the ability to write simple code for a prediction with using data to make that prediction for you. The randomized experiment is supposed to be the pinnacle of rigor in scientific practice and the gold standard of causal inference. Surely, I shouldn’t be cast it in with the acausal predictive dregs of machine learning!
The randomized experiment and prediction have more in common than you’d think, and I don’t mean this in a disparaging way. Randomized experiments are one of the most valuable tools for information gathering in the human-facing sciences. But they are only needed when we lack fundamental understanding.
Why do we do randomized experiments? I like to use medicine as a motivating example as this is one of the main disciplines where randomized experiment has proven revolutionary. The other discipline is AB Testing for engagement on the internet, a decidedly less inspiring use case. Let’s go back to the beginning of the randomized trial to understand the interventions where randomized trials are valuable.
I have blogged before about the initial studies of penicillin. A few case studies were enough to convince doctors that this treatment was revolutionary. It was pretty clear that people who would have died without treatment were saved by penicillin. The effects of infections rapidly receding could be visually observed, sometimes in a few hours.
The major penicillin studies were in the early 1940s. Only a few years later, in 1947, doctors in the UK performed the first randomized controlled experiment in medicine, evaluating the efficacy of the antibiotic streptomycin for treating tuberculosis. Why do a trial when penicillin’s effectiveness had just been established by case study? It’s because streptomycin is not highly effective at curing TB. In the trial, 4 of the 55 patients given streptomycin died. In the control group, treated by bed rest alone, 15 of 52 patients died. This is a large measured treatment effect, but how precisely can we nail down the effect size from this small group of patients? Using a standard statistical estimate, the 95% confidence interval pins the effect as having a 7% to 36% reduction in mortality. A 7% reduction in deaths is definitely significant, but it is much less striking than the effect of penicillin. Streptomycin was the first step in a very long journey to find reliable cures for TB. Nearly a century later, we know that streptomycin is only a marginally effective treatment for TB. The standard of care today is a 12-month course of four different antibiotics.
A. Bradford Hill, one of the architects of the streptomycin trial, would emphasize that effect size estimation was only one of the many values of randomization in experiments. Randomization, in his mind, principally served to remove biases. Before the streptomycin trial, patients would be assigned to treatment and control in alternating order, and clever physicians could figure out whether they were giving a patient the treatment or a placebo. Randomization can help blind the patient and the doctor from knowing which treatment they are applying. More generally, randomization ensures that the only reason an individual is assigned to treatment or control is random chance, removing potential biases that can be introduced by the experimenter. Randomization provides a means to make the treatment and control group as similar as possible so that the only potential difference between the groups is the effect of the action itself. That we also can back out a precise measurement of the size of the effect is an added bonus.
And now, I again break out the c-word. When we do an experiment, we expect from the get-go that our action does something. We set up a randomized experiment to measure the difference between inaction and action. But people love to make an arrogant leap from measurement to metaphysics and peg randomized experiments as enabling causal inference. If we observe a difference between action and inaction, and if we are very careful to ensure the only difference between our treatment group and control group is this action itself, we can assert that the action causes the observed difference.
But cause is an incredibly loaded word. And I worry it mostly leads to trouble and confusion. We knew in advance that our intervention had a cause here. We set the experiment up ourselves. We just wanted to measure the size of the cause. Thinking that the isolated tinkering of randomized experiments can solve all of epistemology is foolish scientific arrogance. I’ll come back to why this bugs me next week.
For now, let me not take away any more from the randomized experiment. Hands down, randomized experiments are the most important contribution of statistics. They are an essential device to measure the impacts of actions. They can’t solve all of our scientific problems, but randomized experiments are still one of the most powerful tools in our empirical belt.