Staging Interventions
Actions are fundamentally different than predictions, but it's hard to write this distinction in math.
This is a live blog of Lecture 18 of the 2025 edition of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents is here.
I’m surprised I didn’t trigger any of my statistical interlocutors on Tuesday with my post about randomized trials. Was I trolling? Maybe a little. I was hoping I’d get scolded about confounding. Because while it feels like you can predict the outcome of actions just like any other pattern, there really is a fundamental wrinkle introduced when we bring actions into the machine learning story.
Our base assumption in machine learning prediction is that data in the past remains the same as data in the future. That is, machine learning models prediction as a fundamentally passive act. Predictions don’t change processes.
But if you implement a new policy based on past data, your future data will be decidedly different. I’m not a big fan of the term, but we can say that actions induce a “distribution shift,” if you will. Once you accept that probabilities are fundamentally changed once you act, the floor sort of falls out of all of our beautiful mathematical decision theory. I’m sure someone figured this out before him, but this is commonly called the Lucas critique of statistical modeling.
So this lecture is a pause before we get into reinforcement learning, where these problems are even worse. Policies change the future. You have to account for these changes when using predictive modeling. Using retrospective data where you didn’t explicitly intervene is always going to be problematic, not just because of confounding, but because the data changes when you have intention. As we move from predictions to actions, we move from passive observation to active meddling. This means we need to figure out the impact of feedback.
Though not typically presented this way, feedback is the lens through which I view biases in retrospective analyses. If we have a complex system in which we observe two variables and hope that changing one changes the other, we need stronger evidence than simple covariation. If you have a retrospective list of actions and outcomes, you can build a prediction of an outcome under treatment and an outcome under no treatment. But we’re usually not happy with this. We say, “That’s just association.”
Using a famous example, you can look at past data and assume red wine lowers the incidence of heart disease. But if everyone started drinking red wine, I personally doubt the incidence would drop further. This is an example of confounding, in which unmodeled factors influence both the treatment (red wine consumption) and the outcome (heart disease). For example, socioeconomic status might raise the likelihood of drinking fine wine and lower the incidence of heart disease because of access to better preventative interventions.
You could teach an entire course on the biases that can creep into the statistical analyses of interventions—Simpson’s Paradox, Berkson’s Paradox, Milton Friedman’s Thermostat, and so on. But I hope to only spend part of the lecture on this. Because retrospective analyses are not the only sorts of studies subject to bias. Randomized controlled trials have their own set of issues, especially when they are not fully blinded.
Because of these biases, I want to spend a little time at least banging my drum about bureaucratic statistics. Though we won’t be doing much on this for the rest of the class, it’s worth noting that most applications of randomized trials occur in regulatory settings. RCTs serve as approval mechanisms for policy changes. RCTs also just measure associations, but these associations are still useful for policymaking.
Mathematically modeling decisions that change the future is the big missing piece in the Meehlian problem of statistical vs clinical judgment. Sure, statistical tabulation is better at prediction than people. But when you broadly implement actuarial policies, you change the distribution and move from snapshot to process.
Modern statistics is ill-equipped to deal with process, but we don’t have a clear alternative class to offer. Once you become obsessed with the problem of process, you either become a complex systems nut or a cybernetics nut, and no one listens to either.1 Kevin Munger calls process theory an “antimeme,” an idea so incongruous with our common discourse that it can’t spread. Whereas people love to argue about statistics, as soon as we get stuck reasoning about process, everyone gets confused and has a hard time talking to each other.
Though we’re not going to get too deep into it this semester, today’s discussion foreshadows a course I’m going to teach in the spring about machine learning, dynamics, and control. I wrote a blog series about this a year and a half ago, and I want to spend a semester fleshing out these ideas, connecting them to concepts in stochastic optimization, dynamic programming, and feedback theory. My goal is to decrypt some of this language about process, causation, and pragmatism. You’ll get to watch the process of attempting (and probably failing) to turn an antimeme into a meme.
I’m lumping the control theorists mostly in bucket 2, even though some definitely hang out in bucket 1.


Is there a difference between a complex systems nut and cybernetics nut? I personally identify as both. Process is truly fascinating
I'm curious to hear you thoughts on Hume's "An Enquiry Concerning Human Understanding" particularly section VII on "necessary connection". Hume ultimately relegates "causality" to a metaphysical concept and suggests all we can reason on is prediction. The field of "causal" inference, particularly in the social sciences where data is observational and collected from a complex system, seems to be very confused about what statistics can and cannot do. Terms like randomization, ignorability, and stable unit treatment values are invoked with almost liturgical regularity, as if their mere presence absolves one from deeper epistemological scrutiny. My field of economics often feels like a causality-themed masquerade ball. Conversations with applied microeconomists these days often resemble a Carrollian tea party: elaborate, incoherent, and indifferent to the actual question at hand. I'm not sure how to best dislodge some of the more ritualistic invocations of “identification” and provoke a more coherent discussion about what we’re actually doing when we claim to estimate causal effects. I'm inclined to offer Bruno de Finetti as the ultimate corrective, a reminder that coherence trumps counterfactual fantasies.
Lastly, I'm not sure how the average treatment effect became the center of focus in "causal" studies since we are ultimately interested in who (which subgroups) specifically might benefit from a treatment. Summary results on clinical trials like the "average treatment effect" may be non-representative of the treatment effect for a typical patient in the trial. Wouldn't a cluster based analysis on benefits and risk be more appropriate?