This is a live blog of Lecture 6 (part 1 of 3) of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents for this series is here.
How does a machine learning model perform on data it hasn’t seen? To answer this question, we must have some theory of how this new data relates to the data we used to build the model in the first place. In the next few blogs, I’ll discuss the prevailing theory used in machine learning. I’ll describe how this theory shapes our conceptions of which algorithms should and shouldn’t work. And, most importantly, I’ll discuss how this theory leads to mostly bad advice for practice. I still think we need to wholly rethink the question of generalization.
In machine learning, the core model turns the question of generalization into a question of probability. We relate new data to old data by assuming both are generated probabilistically by some process. The most common assumption is that each data point, both new and old, is an independent sample from the same probability distribution. That is, we assume each observation is an independent and identically distributed (i.i.d.) sample from the same probability distribution.
Let me begin as I am known to do with a rant. There is almost never a reasonable basis to assume that data is i.i.d.
What would an i.i.d. sample mean? I go to the data store. I ask for a new data point. The data shopkeeper goes to his back room and rolls several twenty-sided dice. Based on this roll, the shopkeeper then looks up the appropriate drawer and hands me whatever is inside. We repeat this Dungeons and Dragons interaction until I have enough data points to run my Kaggle competition.
Is data ever really generated like this?
Perhaps there are more reasonable models for i.i.d. processes. One common way to think about i.i.d. data is assuming there exists a superpopulation of all possible outcomes we could ever observe. The data we analyze is a with-replacement subsample of this superpopulation. We select a data point, record its properties, and put it back into the population like a fish too small to eat.
If we were trying to build models of defects in a manufacturing process, the superpopulation would be all of the products in our factory, and we could fit a model to a subsample. In a medical context, we could assume that the people enrolled in a clinical trial are a sample from the superpopulation of all possible people at all stages of their lives. If we were forecasting elections, we could consider the superpopulation of all possible 2016 elections for president of the United States.
Let me be clear: in none of these examples is a superpopulation reasonable. In the first case, grabbing samples from a manufacturing floor is done without replacement. This is a minor transgression, but a transgression nonetheless. In the second case, we never get a random sample in a clinical trial. The patients end up in a trial for enough specific reasons to require a good deal of care in generalizing the results to the population at large. And election forecasting is a bizarre exercise in clickbait that appeals to the degenerate gamblers in all of us. But the application of statistics there is nonsensical. Sorry, Bayesians.
I’m belaboring this point to note that the superpopulation assumption is also seldom true. Superpopulation models are sometimes reasonable, like when we are intentionally randomly sampling from a well-understood collection of instances. More often than not, it is eminently unreasonable, like when we are predicting one-off events. I.i.d. caveat emptor.
Why do we assume this i.i.d. model if it’s clearly invalid? Because it is eminently amenable to bread-and-butter statistical analysis. One of the first things we learn in a probability class is that if you collect several samples from a probability distribution, the mean of those samples is a good approximation of the expected value of the probability distribution. This is called the law of large numbers.
Let’s apply the law of large numbers to machine learning. If we are interested in knowing the error of a model on new data, we can hope it’s close to the error on the data we’ve seen. The error on new data is called the population error. The average error on the old data is called the empirical error. We say that the generalization error is the difference between the population and empirical errors. Putting this together yields The Fundamental Theorem of Machine Learning:
population error = empirical error + generalization error
I’ll leave the proof as an exercise to the reader. From this, a thousand papers flow.
Under the i.i.d. assumption, if we have some function in a box that we never looked at before, we would know that the generalization error has to get smaller as the number of samples increases. However, we are interested in the function we fit on the sampled data. In this case, we can’t apply the law of large numbers directly.
Think about the following silly example. Let’s say I want to predict someone’s eye color from their Student ID number. I take a sample of the students at Cal (being a state school, we have a superpopulation). I build a look-up table that returns the correct eye color on the sample and returns magenta otherwise. The population error here is 1, because no one has magenta eyes. But the empirical error is zero, so the generalization error is 1.
Generalization Theory develops rules for fitting functions so that the law of large numbers still applies. In the next part of this series, I’ll describe several different ways this can occur. You generalize if you have a small enough set of models from which you choose your predictor. You generalize if you have a sequential prediction bound. You generalize if your machine learning algorithm is robust to leaving out one of your data points. But do any of these paths to generalization describe what we’re seeing in machine learning practice? Stay tuned to find out…
> the application of statistics there is nonsensical
Not trying to incite a holy war here, but it does seem like forecasting is a real skill. Leaving aside elections, what's your interpretation of what's going on when a forecaster on a site like Metaculus consistently wins? More generally, what's going on when a forecaster is rated highly using some metric which rewards calibration in addition to correctness?