While every statistics course leads with how correlation does not imply causation, the methodological jump from observation to causal inference is small. Using the same algorithmic summarization and statistical analysis tools that we use to estimate averages, we can construct a reliable algorithm for evaluating the causal effect of interventions—actions that change the fate of individuals in a population. The crucial addition needed to evaluate cause and effect is the ability to intervene itself.

Let’s say that we have devised some intervention that can be applied to any individual in a population. We’d like to evaluate the impact of the intervention on the broader population by testing it on a small subset of individuals. As an example following on from the last post, the reader can think of height as the property we’d like to affect, and milk consumption as the treatment. We cannot apply the treatment to every individual or else we’d never be able to disentangle whether the treatment caused the outcome or not. The solution is to restrict our attention to a subset of the population, and leverage randomized assignment to eliminate confounding effects.

The simplest mathematical formulation of experiment design often referred to in terms of potential outcomes and was originally conceived by Jerzy Neyman (a character who will likely appear in every blog post in this series). Suppose if we apply a treatment to an individual, the quantity of interest takes value $A$. If we don’t apply the treatment, the quantity of interest is equal to $B$. Then we can define a quantity $Y$ which is equal to $A$ if the treatment is applied and equal to $B$ if the treatment is not applied. $Y$ is a deterministic quantity like height. However, there is an odd conditional effect: the value of $Y$ changes depending on whether we applied the treatment or not.

As a simple example, let $A$ denote the height of a person at age 18 if they drank milk growing up and $B$ denote their height if they did not drink milk. Now, obviously, one person can only take one of these paths! But we can imagine the two alternate realities where the same child either drank a cup of milk a day or drank a cup of water instead. The goal of an experimenter is to determine what would happen to a general individual had they taken either of the two paths in the road. The two potential outcomes here are the outcome if the treatment is applied and the outcome if the treatment is not applied.

While the potential outcomes formulation is tautological, it lets us apply the same ideas and statistics we used for computing the mean to the problem of estimating more complex treatment effects. For any individual, the treatment effect is a relation between the quantities A and B, commonly just the difference $A-B$. If the difference is positive, we see that applying the treatment increases the outcome variable for this individual. If a child drank a lot of milk, perhaps they are taller as an adult than if they only drank water. But, as we’ve discussed, our main hitch is that we can never simultaneously observe $A$ and $B$: once we choose whether to apply the treatment or not, we can only measure the corresponding treated or untreated condition.

This is where statistics can enter. Statistical algorithms can be applied to estimate average treatment effects across the general population. We can examine trends in small groups of individuals and extrapolate the insights to the broader population.

For such extrapolation, there are a variety of conventions for defining population level treatment effects. For example, we can define the average treatment effect to be the difference between the mean of $A$ and the mean of $B$. For those more comfortable seeing this written out as a formula, we can write

$\small{ \text{Average Treatment Effect} = \text{mean}(A)-\text{mean}(B)}$

In our milk example, this would be the difference in the mean of the population height if everyone drank milk versus no one drank milk.

Other population level quantities of interest arise when $A$ and $B$ represent binary outcomes. This could be, say, whether a person is over six feet tall as an adult. Or, for a more salient contemporary example, this could be whether a patient catches a disease or not in a vaccine study. In this case, $A$ is whether the patient catches the disease after receiving a vaccine and $B$ is whether the patient catches the disease after receiving a placebo.

The odds that an individual catches the disease is the number of people who catch the disease divided by the number who do not. The odds ratio for a treatment is the odds when every person receives the vaccine divided by the odds when no one receives the vaccine. We can write this out as a formula in terms of our quantities $A$ and $B$: when $A$ and $B$ can only take values 0 or 1, $\text{mean}(A)$ is the number of individuals for which $A=1$ divided by the total number of individuals. Hence, we can write the odds ratio as

$\small{ \text{Odds Ratio} = \frac{\text{mean}(A)}{1-\text{mean}(A)} \cdot \frac{ 1-\text{mean}(B)}{\text{mean}(B)}}$

This measures the decrease (or increase!) of the odds of a bad event happening when the treatment is applied. When the odds ratio is less than 1, the odds of a bad event are lower if the treatment is applied. When the odds ratio is greater than 1, the odds of a bad event are higher if the treatment is applied.

Similarly, the risk that an individual catches the disease is the ratio of the number of people who catch the disease to the total population size. Risk and odds are similar quantities, but some disciplines prefer one to the other by convention. The risk ratio is the fraction of bad events when a treatment is applied divided by the fraction of bad events when not applied. Again, in a formula,

$\small{ \text{Risk Ratio} = \frac{\text{mean}(A)}{\text{mean}(B)}}$

The risk ratio measures the increase or decrease of relative risk of a bad event when the treatment is applied. In the recent context of vaccines, this ratio is popularly reported differently. The effectiveness of a treatment is one minus the risk ratio.

This is precisely the number used when people say a vaccine is 95% effective. It is equivalent to saying that the proportion of those treated who fell ill was 20 times less than the proportion of those not treated who fell ill. Importantly, it does not mean that one has a 5% chance of contracting the disease.

Randomized algorithms give us cut and dry techniques to construct high accuracy estimators of population-level effects. Specifically, we can frame the estimation of the various measures of treatment effects as a particular statistical sampling strategy.

Think of the potential outcomes framework as doubling the size of the population. Each individual has an outcome under treatment and not under treatment. Hence, if we randomly select a sample and then randomly assign a treatment to each individual of the sample, we can compute the mean values of all subjects assigned to the treatment and all patients assigned to control. As long as $A$ and $B$ are bounded, such means of our samples are reasonable estimates of all of the treatment effects provided the number of samples is large enough.

The two stage process of building a sample and then randomizing assignment is equivalent to computing a random sample of the potential outcomes population. Randomized assignment allows us to probe population level effects without observing both outcomes of each individual. It’s an algorithmic strategy to extract information: Experiments are algorithms.

Just like in the last post, this sampling method does not assume anything about the randomness of A and B. This randomized experiment design assumes that we can select samples at random from the population and assign treatments at random. But the individual treatment effects can be either deterministic or random. We do not need a probabilistic view of the universe in order to take advantage of the power of randomized experiments and prediction. Statistics still serve as a way to reason about the proportion of beneficial and adverse effects of interventions.

Moreover, elementary statistics allows us to quantify how confident we should be in the point estimate generated by our experiment with very little knowledge about the processes behind A and B. If the outcomes are binary, we can compute exact confidence intervals for our outcomes, regardless of how they are distributed. For example, returning to my favorite example of the Pfizer vaccine trial, the confidence intervals used were the Clopper-Pearson intervals, which are directly derived from the binomial distribution. The effect size was so large that simple statistics revealed an impressively large effect.

There are certainly limitations with the randomized experiment paradigm. Reliability estimating treatment effects needs the variability of the outcomes to be low, average effects may hide important variation across the population, and temporal dynamics and feedback effects can impact causal conclusions. In future posts, I’ll dive into these critiques in more detail.

Despite the potential limitations, it’s remarkable how causal effects can be measured with some rudimentary sampling and statistics. The same ideas used to estimate a mean can immediately be applied to estimate average effects of interventions. In both cases, we needed only modest knowledge of the effects under study to design algorithmic measures and to establish confidence intervals on their outcomes. In the next post, we’ll explore whether a similar program can be applied to the art of prediction and machine learning. (Spoiler alert: it’s complicated!)

Finally, I have to discuss an elephant I’ve left in the room. Determining cause and effect becomes impossibly challenging once we can’t intervene. For example, suppose we are trying to understand the effectiveness of a vaccine outside a well controlled clinical trial. In the wild, we have no control over who takes the vaccine, but instead can sample from a general population where a vaccine is available and count the number of people who got sick. Determining cause and effect from such observational data requires more modeling, knowledge, and statistical machinery. And no matter how sophisticated the analysis, arguments about hidden confounding variables and other counterfactuals are inevitable. Whenever naysayers are yelling how correlation doesn’t imply causation, it’s always targeting an observational study rather than a randomized controlled trial. For a comprehensive introduction to the deep complexity of this topic, let me shamelessly plug the causality chapter in Patterns, Predictions, and Actions, which features both my favorite introduction to observational causal inference penned by Moritz and a version of this blog on experiments.