This is a live blog of Lecture 1 (part 3 of 3) of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents for this series is here.
Predictions are all well and good, but at some point you need to do something with them. The second half of the course is about acting based on data and patterns.
Last Thursday, a student asked a great question (I’ll learn your names if we keep the class under 100), “Isn’t decision-making just pattern recognition?” Plenty of evidence suggests that human experts make decisions based more on pattern recognition than some rational choice optimization procedure. But what about machines? To build robust, safe, and efficient autonomous systems, would it suffice to use the machine pattern recognition methods from the first half of the course?
Our planet is blanketed with engineered systems that process information and make decisions. Would we say that a thermostat is a pattern recognition system? If it measures that the temperature is too hot, it turns the cooling on. If the temperature is too cold, it turns the cooling off. It’s making a decision based on data. But it enforces a simple hard-coded rule. It does not match its current experience against a record of past encounters and predicts the best action.
But would we be better off with machine-learning thermostats? Would we be better off with self-driving cars that learned from their chaotic encounters with emergency vehicles? The term “machine learning” was coined by Arthur Samuel in 1959 to describe systems that improve their performance via repeated experience. Samuel, in particular, built a machine learning system for checkers—called Alpha—that honed its skills through repeated self-play. Samuel’s approach to learning was a primitive form of what we now know as Temporal Difference learning. We also know that Temporal Difference learning is an approximation to Dynamic Programming. Dynamic Programming aims to learn a policy that maximizes some desirable outcome in a dynamic system with uncertainty.
That’s a nice way to phrase the central question of the course:
How do we algorithmically find policies that maximize desirable outcomes in uncertain dynamical systems?
Optimization researchers call this problem “Stochastic Optimization.” But, as Max Raginsky elegantly lays out on his substack, this problem is also about casual inference. We have systems where we can intervene. They have inputs that let us attempt to steer them. We have some outputs we’d like to target. This naturally leads to the question of counterfactuals: what would happen if we choose one set of inputs versus the other? It also leads to questions about “optimal control,” where we find the inputs that maximize some output.
So we’ll start with such counterfactual analyses, asking how to choose the best of two actions. This binary decision-making will feel a lot like pattern classification, labeling a data point with a “0” or a “1” based on context. Decisions with binary actions will take us through Optimal Decision Theory, the Neyman-Pearson Lemma, and ROC curves. We can think about how decisions have different outcomes in different subpopulations, letting us discuss issues of fairness in decision-making. All of these discussions will be model-based trying to determine the optimal action with a complete model of uncertainty. But when we don’t know a model, I’ll discuss how to make decisions using experiment design and randomized trials. From here, we’ll talk about observational causal inference, which tries to run randomized control trials without ever actually intervening with the world. It’s armchair policymaking.
We’ll then transition to trying to find policies that make multiple decisions over time to maximize their objective. We will discuss dynamic programming and optimal control. Dynamic programming again requires a model of how the world works. What do we do if we don’t know this model? To attempt to answer this question, we’ll discuss approximate dynamic programming, bandits, and what we call reinforcement learning. The class will end where machine learning started: with a discussion of computational gameplay, why it’s interesting intellectually, and why it’s useless for decision-making in reality.
As you can see, there will be a heavy reliance on modeling! Those models are built by the engineers, not by the machines. There is a natural abstraction in decision-making where we first summarize our experiences into an optimization model of goals, constraints, and uncertainty. We then use this model to find the best policy. Alternatively, we could have gone from data to policy using a big mess of machine learning in the middle. We’re going to try to understand why this “end-to-end” alternative doesn’t work for any mission-critical problem.
But this brings us back to the beginning. Didn’t I argue that there’s substantial evidence that human decision-makers are pattern recognizers? There isn’t a contradiction because people recognize patterns differently than how they build machines to recognize patterns. I have been striving to better articulate why these are different. Maybe this semester, I’ll finally be able to describe the difference more clearly.
Thanks for the shoutout, Ben!
As for whether decision-making is ultimately about pattern recognition, my current take, informed by imbibing heavy doses of evolutionary epistemology literature written by biologists and ethologists (Konrad Lorenz, Rupert Riedl), is that yes, at some sufficiently low level, that’s all there is, and more complex systems including living organisms just wrap layers of abstraction and feedback around simple pattern recognizers.
The distinction between "pattern recognition" and "optimizing models" is not obvious to me. Of course, at some intuitive level, binary classification feels different from LQR. But I think this glosses over the modelling that goes into setting up a classification task. Isn't picking the features, labels, and loss function also modelling? Maybe that's what you're getting at in the last paragraph here.
A related paradox (to me) is that learning a fixed predictive model is somehow seen as more general than doing state estimation, because state estimation requires specifying a model of dynamics. But the fixed predictive model also has a dynamics model -- one that says "nothing changes"!