I’m teaching machine learning from a single textbook for the first time in my career. Of course, it’s because Moritz Hardt and I tortured ourselves to write a book, Patterns, Predictions, and Actions, that we felt encompassed what a grad course in machine learning should be in the 2020s. But I’m proud of where we converged, and I hope you all enjoy our book as well. Even though I have the text, I’m going to modify the syllabus this semester to slightly deviate from the book. The machine learning story isn’t quite finished.
I’m planning on an even cleaner split between predictions and actions. The first half of the course will cover supervised machine learning, starting with and ending with the perceptron. It’s bizarre that though we’ve had 70 years of research detours in between, Claude Shannon’s language model from 1949 and Frank Rosenblatt’s Perceptron from 1955 tell us more or less everything there is to know about large language models and other state-of-the-art “AI.” But that’s how research works sometimes. The same thing happened to physics: though there were many wild supersymmetric string theories, fundamental physics theory is stuck with the same Standard Model it had in the 1960s.
I don’t even think this is a bad thing. We learned plenty in the interim. And one of the things we learned was that Shannon and Rosenblatt were right. We also learned that we needed 70 years of computer engineering to build machines to see their vision. Many researchers in the 1960s made similar claims that all they needed were faster computers and more memory. They turned out to be correct. Shannon was able to predict next letters in Dumas Malone’s biography of Thomas Jefferson without a computer. But to get chatGPT, we need billions of dollars in cutting-edge computing.
My focus in the prediction phase of the course will be to isolate when we need to invoke probability. There are two kinds of probability that fly around in statistical science: intentional probability and natural probability. Intentional probability comes from random number generators that let us probe, measure, and simulate. For example, randomly selecting a sample from a web-scale data set is intentional probability. Sampling grid points to build an interpolating polynomial is intentional probability. Randomly initializing the weights in a neural network is intentional probability. Generating text from a probabilistic language model is intentional probability.
Natural probability is where we assume the world hands us reality using its own random number generator. In physics, we might say that certain signal patterns are practically indistinguishable from white noise and hence model those signals as stochastic. But in machine learning, we mindlessly invoke probability as the source of all uncertainty. Notably, we use natural probability as a crutch when we say that data is magically “sampled” from a probability distribution. We carelessly make assumptions that data is drawn independently and identically from an unknown distribution and then prove theorems or make guidances about practice.
This semester, I aim to show that you can derive most of what we do in machine learning prediction without ever invoking natural probability. For most models of practical interest in prediction, you get more or less the same answers whether or not you assume a probabilistic model.
I know some theorists will yell at me about this “oh, but you need randomness to learn a bifurcated nonlinear threshold in hyperbolic space.” But the problem with machine learning theory is that is mostly made up. Us theorists play a weird game where we assume that a bunch of unverifiable assumptions hold and then use these assumptions to prove prediction error rates go to zero. These theorems are valid based on the axioms we assume. But if our axioms are all unverifiable, and if the guidance of the theory doesn’t reflect practice, then what is this theory for? I’ll have a lot more to say about this in the coming weeks and look forward to being yelled at set straight by my theory friends.
Now, when the course shifts to actions in mid-October, I get stuck. The second half of the course is about decision-making under uncertainty, and everything in our book models uncertainty as stochastic processes. The Actions component is then a focused tour of stochastic optimization: minimizing the expected value of a cost function under some probabilistic model of uncertainty. Minimizing Bayes Risk in decision theory. Random hypothesis testing. Bandits. Dynamic Programming of Markov Decision Processes. All of these models rest on the decision maker’s environment being random. I don’t know how to do this now, but my semester goal is to figure out how to cast these problems without probability. I don’t know whether the answer will come from Robust Optimization or Metaphysics. But send me suggestions if you have them. And I’ll keep you posted as the semester progresses.
Sounds like it will be an interesting semester! On the action side, some of the things you mention apply in totally deterministic environments (e.g. LQR looks the same even in the absence of process noise). And it's often possible to replace "assume Guassian noise" with "minimize an appropriate least squares objective" (e.g., for state estimation: https://slides.com/sarahdean-2/08-state-estimation-ml-in-feedback-sys?token=565bwizg#/12/0/3) -- of course, without a Gaussian model, there's no "deeper" motivation for why least squares is the "correct" objective to have.
On the static prediction side, I find the framing of Michael Kim's "Outcome Indistinguishability" helpful for thinking about where uncertainty comes from. I also like the philosophy paper that inspired the work: https://link.springer.com/article/10.1007/s11229-015-0953-4. It provides a nice taxonomy of interpretations of probability. (I made some summary slides of it here: https://slides.com/sarahdean-2/aipp-probability-time-individual-risk?token=vx-PDQk9)
Are you planning to teach anything about LLMs, Ben?