Fact, Fiction, and Forecast
A syllabus for a half-semester course on prediction from data.
“Regularities are where you find them, and you can find them anywhere.” -Nelson Goodman.
This is a live blog of Lecture 1 (part 2 of 3) of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents for this series is here.
When Moritz and I decided to reboot graduate machine learning at Berkeley, we weren’t sure where to start. Our main grad class was on graphical models, and in 2019, this was out of fashion. But what would be appropriate? We came up with this idea to teach random hypothesis testing, then show how supervised learning approximates hypothesis testing, and then move on to study empirical risk minimization. I was discussing the idea with some colleagues at a workshop in Seattle, and Kevin Jamieson jerked his head back scrunched up his face (in the way that Kevin does). “Uh, that sounds an awful lot like Duda and Hart.”
I hadn’t remembered it this way. I had only skimmed the 2001 version of Pattern Classification, coauthored with David Stork. But I checked out the first edition from the Berkeley library and was shocked. Kevin was right. The Table of Contents:
Bayes Decision Theory
Linear Discriminant Functions
This was how Duda and Hart taught machine learning in 1969 at Stanford. This would be totally acceptable as a course on machine learning in 2023. Duda and Hart didn’t call it machine learning because that term meant something entirely different at the time. Arthur Samuel coined machine learning to describe how his checkers program improved itself through repeated play. What Samuel called machine learning, we now call reinforcement learning. Duda and Hart were after something else and called their course pattern classification.
Pattern classification is a much more honest description of what we do today in machine learning. But it’s harder to sell a startup if you’re classifying patterns than if you’re solving artificial general intelligence.
Moritz and I started to track down other connections. We read Minsky’s PhD thesis. We found Rosenblatt’s writing on the Perceptron. We read Bryson’s paper on the method of adjoints. We discovered the work of Bill Highleyman, establishing the train-test paradigm of machine learning evaluation.
Even the theoretical results of the time were as convincing as anything we have today. Sasha Rakhlin sent us a beautifully short proof of why the Perceptron generalizes beyond the data on which it was trained. This proof appeared in Vapnik and Chervonenkis’ Theory of Pattern Recognition in 1974, but Sasha thinks the argument is due to Aizerman from the mid-1960s.
Like all proofs on generalization, this argument rests upon wholly unverifiable assumptions about how new data will relate to the old. In this case, it assumes the data stream consists of samples drawn independently from the same generative distribution. We haven’t come up with a more compelling argument for why pattern classification works in the interim.
And so our course today tries to understand the past to help guide us into the future. The first half is a survey of the pre-1970s work on pattern classification presented with our modern perspective. In some ways, the mathematical arguments are much simpler today. I’ll start with a survey of prediction, how we evaluate prediction, and what prediction might be useful for. This discussion will lead us to conclude that we should build pattern classifiers by finding computer programs that maximize their accuracy on the data we have collected. I’ll then present four complementary ways of thinking about why such prediction algorithms work, focusing on linear prediction functions. I’ll start with the classic version of Perceptron and the mistake bound (1955-1961). I’ll then describe the modern version of this argument based on sequence prediction. I’ll discuss the interpolation version of prediction and what that suggests about data collection. And I’ll give an optimization view of prediction that directly solves the accuracy maximization problem.
What about nonlinear functions? No model is actually linear because the “features” we decide to use are already nonlinear measurements of some phenomenon. Where do these features come from? And how do we build more complex features? The next part of the course will cover measurement, digitization, and universal approximation.
Once you have a nonlinear model, what do you do? In some cases, the linear methods directly apply. But this is also where we fast forward to the 1980s and describe backpropagation. We now know that backpropagation is the stochastic gradient method (Robbins and Monro, 1950) combined with Bryson’s method of adjoints for automatic differentiation (1962). ML researchers put these together in the 1980s. I’ll discuss why it seems like nonlinear optimization is not much harder than linear optimization for prediction. I don’t have a good answer, but the evidence suggests that prediction is a very special optimization problem. I’ll try to give insights as to why it is particularly easy to optimize. I’ll also address the question of “network size.” We live in an age where we think bigger models are better. But this idea has plenty of antecedents, and I’ll talk about what people were doing in the 1960s as a reference.
The last part of the first half of the course is about datasets. Datasets play a pivotal role in machine learning. When did we decide that all evaluation would effectively be through leaderboards and Kaggle competitions? I’ll describe the history of datasets in machine learning, starting at Bell Labs, moving to UC Irvine, and ending in the present. I’ll then describe the cultural patterns around how we use data sets. This will let me highlight how the conventional “statistical” wisdom on how machine learning behaves out-of-sample is flat-out wrong.
The second half of the course is going to be decidedly different. It will be about taking action. Surveying the course plan for this will need a whole blog post of its own. That post will come on Monday.