Learning from clairvoyance
The best case for future evaluation fixes our methods in the present.
This is a live blog of Lecture 1 of the 2025 edition of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents is here.
Machine learning is the study of prediction from examples. Usually, we think of this as using examples we’ve already seen to make predictions about the future. To motivate methods that make inferences on past data, let’s start with what the best predictions could be if we knew the future.
In machine learning, we imagine there is a population of all possible events that we could ever forecast. This population is almost always metaphysical, meaning it doesn’t exist yet. For example, we build face detectors to work on images that have not yet been captured. We build diagnostic tests for patients who are not yet sick. These individuals do not yet exist, so we must hypothesize their existence. That means we must at least build a mental model of what the best predictions could be for this population.
Machine learning starts by quantifying predictive performance on this population. This is a numerical model of the cost of being wrong. Almost always, we model this as the average over all possible instances we’ll be tested on, with various weighting on the different cases. I like this quote from Norbert Wiener, which succinctly summarizes the situation. We can’t evaluate an engineering system on a single case, but only on a set of test problems.
“No apparatus for conveying information is useful unless it is designed to operate, not on a particular message, but on a set of messages, and its effectiveness is to be judged by the way in which it performs on the average on messages of this set. ‘On the average’ means that we have a way of estimating which messages are frequent and which rare…”1
Wiener was describing how to evaluate the performance of a communication system, but the same principles apply to prediction. We design prediction systems by envisioning the individual scenarios on which we’ll be tested, and the relative weight of each case. We can then minimize the prediction error on average. We can define what an error is mathematically, either by merely counting mistakes or by tabulating the error in a mean-squared sense.
With the quantified error metric in hand, we are already optimizing on day 1. We seek the predictions that minimize the average of the errors.
Now, it seems silly to ask for the optimal predictions on a population where you know all of the outcomes. If you clairvoyantly knew that individual i had outcome y, then the best thing to do would be to predict y. But what if you had to make predictions from a fixed set of machine-readable data? Each individual would only be identified by a vector of features or attributes x. Your goal as a forecaster is to predict the outcome or label y from x alone.
Let me be explicit. We know in advance that there’s a list of N individuals. We want to make predictions about these individuals. We decide in advance that we will make predictions only using a set of attributes presented in a machine-readable format. In line with machine learning parlance, I call these the individual’s features and write them as x. Our goal is to predict a different attribute, called a label, denoted by y, from the features. Let’s work out what happens when the label is binary, only taking values 0 and 1. This case forms the backbone of machine learning.
In the case of binary labels, it’s trivial to see that the optimal prediction on the population is to take all of the individuals whose feature value is equal to x and compute the rate at which the label is equal to 1. Let’s call this rate r(x). Then the optimal prediction is some function of this rate. You can check the class notes for a short mathematical derivation.
If you tell me I’ll be evaluated by a Brier score, the optimal prediction for an individual with feature vector x is precisely r(x). If you tell me I’ll be evaluated on error rate, the optimal prediction is 1 if r(x) is greater than 1/2 and 0 otherwise. When being evaluated on averages, all you need to know to make a prediction is the loss function that scores errors and the rate at which the label is equal to 1 for each feature vector.
This already puts us in an interesting case for machine learning. We can imagine there are very few feature vectors, but the outcomes vary widely. For example, medical diagnostics often use only a few basic attributes, such as age, sex, medical history, and lab tests, to make a prognosis. In such scenarios, we would work hard to estimate base rates in our current population and assume they translate into the future.
We could also have a prediction problem where there are numerous features and no variability in the label. In image classification, it would be impossible to ever have exactly the same image in an evaluation corpus. In this case, we’d like to estimate how the labels vary from image to image and come up with a computer program that memorizes the current data, but can smoothly interpolate the label between perceptually related images.
These are the two main cases we’ll look at when we turn to supervised learning. When feature vectors do not uniquely define individuals, machine learning is actuarial, counting rates to estimate likelihoods of outcomes. When feature vectors are unique, the question arises as to whether we can write a computationally efficient program that takes feature vectors as input and outputs predictions. In this latter case, when all individuals are unique, machine learning is interpolation. The task becomes understanding how to relate the labels between similar feature vectors. Most prediction problems require bringing to bear a bit of both actuarial and interpolative thinking, text being a notable example.
The first unit of this class focuses on these imaginary best-case scenarios for metaphysical populations. If we saw all the data, how well could we do? Evaluating this hypothetical forms a goal to strive for. It also shows us something obvious but missed: once you know what measure you are going to target, your fate as a forecaster is sort of set. The metric, which will only be computed in the future, determines how you process the past.
Norbert Wiener. Extrapolation, Interpolation, and Smoothing of Stationary Time Series. MIT Press. 1949.
I have a counterpoint to Wiener's claim from a while back. Basically, there are things that an entirely additive loss function cannot express.
https://weary-travelers.gitlab.io/posts/ideas/non-additive-losses/idea.html
You might want to use cheap AI to correct spelling. "Evaluting" stood out like a sore thumb.
Actually, I thought teh slide deck seemed quite understandable, at least on first glance.