Let’s shift from evaluating bits to evaluating machine learning. Last week, I described a way to think about how evaluation ties our hands when predicting bits. No matter what stories we told ourselves about data generation, evaluating predictions by average error was enough to constrain predictions to be averages of the past.
Let’s now slightly complicate the story. Instead of predicting one stream of bits, we want to predict many different streams. Every time we have to predict a bit, we note that this bit comes from a particular context. Inside the context, once we accept that all bits are equal, we saw last week that this amounted to predicting the average of what we had already seen. It’s not too surprising that our evaluation metric ties our hands again in the case of multiple contexts. The best thing to do is predict the average in each context.
If we expand our prediction of free throws from Steph Curry to the rest of the NBA, we will average each player’s results on the season when they come to the line. If we wanted to predict whether a kicker would make a field goal, we might average all their kicks in that range. If the ball is on the 34 yard line, we will average the success rate of the kicker on attempts between the 30 and the 40. You can come up with whatever sports stats you’d like, and you’ll lean on similar procedures.
When we score multiple contexts by average error, the optimal prediction is the conditional probability of bits. Conditioned on this context, how frequently is there a success? What is the probability that y occurs given that my data is x? The answer is given by the regression function
f(x) = E[y | x]
I’ll skip the math today, but if you look at the second chapter of Patterns, Predictions, and Actions, Moritz and I derive that this is the ideal predictor. We worked out the iid case, but you can derive the same result from the nonrandom case, albeit with more complicated calculations.
Now, computing this conditional frequency is impossible when there are too many contexts. At some point, as you make your categories finer and finer, you run out of similar data to make predictions. How many times has Steph Curry hit free throws in the Crypto.com Arena with eight minutes and nine seconds left in the third quarter? This reference class is too small to make any reasonable guesses.
A context is good if you have seen enough events with the same context to make a confident statistical prediction. Interestingly, the meaning of “enough” varies by context. If the probability of an event happening is p, you need more than 1/(p-½)^2 events to reliably decide if the probability is greater than or less than a half.
Let me narrow the focus. There are plenty of prediction problems where p is very close to 0 or 1. If my context is an array of pixels on a screen, the probability these particular pixels form an image containing a cat is 0 or 1. Similarly, in text prediction, the probability that a specific token follows a text window is almost always 0, and, in the rare nonzero cases, it is equal to 1 more often than not.
Very predictable bits can be predicted with almost no data. With zero instances from this context, your prediction error is ½. However, seeing one instance immediately drops the prediction error to 2p(1-p). If p is close to 0 or 1, you make excellent predictions as soon as you see one case.
If we have a lot of contexts with very predictable bits, we find ourselves in a very particular corner of the bit prediction universe. Our goal now is to see as many contexts as possible. For any new data, we would then return whatever we saw the last time we saw the same context. What’s funny here is that you might say this is overfitting. It’s not overfitting. You can’t overfit if you have all of the data that could ever arise.1
Of course, until Open AI spends a trillion dollars on data centers, there are still far too many contexts to observe in the universe. All possible future outcomes do not yet exist in an S3 bucket somewhere. When there are unseen contexts, we need some way to interpolate between the collection of contexts seen so far.
There’s only one problem: there will be an infinite number of ways of approximating the regression function on the data we have. An infinite set of functions will achieve the same average error in the contexts we’ve seen. How do we pick from this list?
The answer we’ve settled on in machine learning is sensible. Split the data into two sets at random. Call one set the training set and the other set the test set. Among all the functions that achieve around the same average error on the training set, pick the one that minimizes the error on the test set. This is the holdout method. It is a particular form of a broader idea called cross-validation.
There is much to like about the holdout method. The random split is legitimately random as the engineer chooses the partition. This means we can use large deviation inequalities to estimate the difference between the error on the test set and the collected data from which it was sampled. We already assumed that having a small average error on the available data was a reasonable metric for a small average error on the data we haven’t seen. The training set lets us compute a vast set of interpolating functions to competitively test.
Though quite sensible, the holdout method has dramatically outperformed its modest expectations. Once you have a holdout set, you have a benchmark. And once you have a benchmark, you inspire engineering through competitive testing–Goodhart’s Law be damned.
I’ll come back to this soon, but I generally think that “overfitting” in the way it is colloquially used in data science and machine learning doesn’t exist at all.
I once searched troves of stats, ML and phil sci literature for a general definition of overfitting. There is none. The best thing that can be said is that it is a relation between a fitting method, a model and data. If someone doesn't like the relation they call it overfit. ML people then decided to confuse me even more and renamed perfect fit to (benign) overfit.
Looking forward to reading an installment on why overfitting doesn't exist