Patterns, Predictions, and Actions (revisited)
The syllabus for this semester's graduate course on machine learning.
So I promised you a syllabus for graduate machine learning! Well, here’s one for this semester that I reserve the right to change as the course plays out.
As always, this course takes a narrow view of machine learning, casting it as building predictions from examples. The main turn this year, which has been super helpful, is grounding everything in how we evaluate these predictions. This theme of evaluation was already implicit in Patterns, Predictions, and Actions, but making it explicit coherently ties the semester together.
Declaring how we will evaluate our predictions determines how we make the predictions. This Metrical Determinism will be a recurring theme in the class. If I tell you the metric, you can determine the optimal decision rule. More often than not, predictive engineering just targets this optimum. To make this clear, we’ll cover the basics of decision theory, where this metrical determinism is explicit. Decision theory tells us what the optimal predictions would be if we could see all of the data in advance. If you force yourself to make predictions with a fixed format of data, what would be the best predictions you could make? We’ll build this theory based on samples, not probabilities. Not much changes in this sample-based, actuarial view, but it resolves the paradoxes posed in Meehl’s Clinical versus Statistical Decisions. We’ll discuss all the different error metrics you might set for predictions and decisions, and learn about impossibility results like the Neyman-Pearson Lemma and orthogonality theorems from the fairness literature.
We’ll then transition from decision theory to supervised learning, which we can think of as an approximate form of decision theory. Decision theory tells us what the optimal predictions would be if we could see all of the data in advance. Supervised learning tries to find them when we only see a subset of the data. We’ll focus on what’s most relevant in machine learning, specifically a non-exhaustive survey of how we represent and transform data and the rudiments of stochastic gradient descent.
Next, we’ll discuss how we evaluate the predictions of supervised learning. This is, of course, through competitive testing on benchmark data sets. We’ll go over the train-test split and the motivations for it (both cultural and statistical). We’ll discuss the many successful benchmarks in machine learning and consider what made these useful for the predictive engineers of their respective eras. We’ll also discuss what happens when the data we test on is different from the data we trained on.
We’ll then turn to other proposed evaluations, getting into my favorite topic: arguing with statisticians.1 We’ll talk about putting error bars on evaluations and predictions, and what these mean (Spoiler Alert: the answer is not much). But by understanding the principles of error bars, we can discuss AB Tests and RCTs. Such randomized evaluations are prevalent tools used to evaluate not just machine learning, but all sorts of policies and decision systems. It’s a short hop from the AB Test to the multi-armed bandit, and this will let us touch on the basics of bandits and sequential prediction.
Finally, the class ends with reinforcement learning. Now, in the past, this was an easy handoff, moving from sequential prediction to “sequential prediction with dynamics.” When I wrote this survey in 2018, that’s all more or less what reinforcement learning was. But contemporary reinforcement learning, the kind people are all excited about in LLM-land, is not this at all. It is a reformist reinforcement learning that uses all of the old words from Sutton and Barto, but has almost nothing to do with this older work. So rather than presenting conservative RL, which doesn’t work, we’re just going to take a total tangent and dive into this reformist RL, explaining what it is so we can get a sense for what people are writing all these papers about. You can do this without ever learning dynamic programming. You can do this without knowing what an MDP is. You can do this without ever saying “advantage function” or “cost-to-go.” Reformist RL is essentially “guess and check.” It is a way to inefficiently flail around looking for supervised learning signals in a world where they are a scarce resource. Reformist RL is not as cleanly analyzable as the perceptron (yet). But maybe there’s a simple analysis and algorithm for how to do it. Let’s see if we can find one before the ICML deadline.
This syllabus is not exactly what I want yet, but it’s getting there. I would like to fit in a few more lectures on the “modeling zoo” to help decide between the various optimizers and models available to the intrepid vibe coder. Some organizing principles for navigating the vast ML toolbox would be nice, no? I might try to squeeze that in depending on how the semester goes. In the sequential prediction unit, I’d like to mention some of the basic concepts of Defensive Forecasting from my survey with Juanky Perdomo. We’ll see how it goes and what time allows.
Most of the material I’ve discussed here is already in PPA, but some is new. In particular, I finally feel like I can teach this course without ever imagining a hypothetical “data-generating distribution” that feeds us independent, identically distributed samples (famous last words). I’ll be supplementing parts with new notes that I’ll link off this blog. They will be filled with errors, so help me find them. I’ll post what I actually cover, including all of the readings, here on the blog. In particular, I haven’t quite figured out the readings for the latter third of the course. I’ll draw what I can from PPA, but will also supplement with my own notes and a few external sources.
One additional thing I’ll post here this semester are problem sets. We’re going to try a new experiment in this class: frontloading the work on the final project. Each problem set will involve some questions about the project, and I’ll post those here. The only way to really learn machine learning and the metrical determinism of predictive optimization is by applying it to predictions that matter to you.
I wanted to teach an entire course called “How To Argue With Statisticians,” but couldn’t get the logistics worked out for this year. Hopefully, I’ll be allowed to do this in 2026. For now, we’ll have to settle for a short version in the middle of the machine learning class. I swear it’s relevant!