The prototypical statistical prediction problem is bit prediction: Having seen a long list of 0s and 1s, your task is to predict whether the next bit equals zero or one. For example, how might I predict whether Steph Curry will make his next free throw? I might look at all the times Steph attempted free throws this season. From this, I could compute his free-throw percentage (94.0%). I can then use this percentage to inform my prediction (I’m guessing he’s going to make it).
Why did I use Steph’s free throws from this season? I could have used all of the free throws in his NBA career, which would have changed the free-throw percentage to 91.1%. This might dampen my confidence a bit, but only a little. I probably wouldn’t have taken all of the free throws taken by all players in the NBA this season (their success rate is only 78%). I also wouldn’t only look at Steph’s free throw percentage from yesterday. Even for an all-time great shooter, shooting percentage varies from day to day, and the number of attempts in any individual game feels like too few to get a reliable handle on his abilities.
This example may seem cutesy, but it’s more or less what we do in forecasting and machine learning. We gather an appropriate reference class of all events related to the one we want to predict. Once we have a reference class, our task just becomes predicting a bit: the bit is one if the event happens, zero if not. To predict the bit, we compute the fraction of the time that the event happened in the reference class. We then use this fraction to inform our prediction, leaning towards predicting it happening when the fraction is close to 1.
Change my frivolous sports example to predicting recidivism based on a psychometric test or sepsis based on an EHR readout. The story remains the same. We reduce complex classification problems to predicting bits.
So how do we evaluate methods for bit prediction? Why do we use the method of averaging in reference classes? Let’s look at the goal of evaluation I described in Tuesday’s post.
Evaluation measures the average difference between our articulated expectations of a system and its actual performance.
For bit prediction, I’ll let the distance between expectations and performance be 1 if we get the answer right and 0 if we get the answer wrong. In math, I’m going to score in terms of mean-squared error.1
The score equals zero when the prediction matches reality and one when it misses.
Let me make our lives a bit easier today and say that we don’t have to predict a bit, but instead a number between zero and one. This way, if we predict 0.94 and the event happens, our score is 0.004, which is much smaller than if we predicted 0.5. In some way, this real-valued prediction quantifies our confidence in future predictions. Predicting confidence makes our lives easier. We could say we’re predicting the probability of the event happening. This is nicer than having to predict an actual bit because we can claim to never be wrong. As long as we predict something strictly greater than 0 and strictly less than 1, we are saying there’s a chance of both outcomes. Hedging with probabilities is key to making money as a forecaster.
Now that I’ve set rules of evaluation, what should you predict? You have to make some guess about how all of the bits you’ve seen before relate to the ones you’ll be tested on. If you articulate how you will be evaluated and model your expectations of future events, you can optimize your model and hope it reflects reality.
What’s a good model for a stream of bits? There are a few models that people have decided to run with as standard, and I will highlight the three I most frequently encounter.
Following a legacy started by Ronald Fisher and pushed heavily in engineering applications by Wiener and Shannon, the first two models assume that all data is random. In this case, personal expectations become mathematical, probabilistic expectations. There are two popular random models of bits:2
i.i.d. That is, is independent and identically distributed. In this model, every outcome at every time is equivalent to flipping a biased coin. Each coin flip is completely uninfluenced by every other coin flip. And the mechanism that generates the coin flip is the same each time. Data is generated by repeating the identical mechanism over and over and over again.
Exchangeable. i.i.d. is a pretty strong assumption! Exchangeable is the model that makes Bayesians feel better about their methodologies. In the exchangeable model, there is a random process that generates a sequence of bits, and the probability of any sequence is equal to the probability of any reordering of that sequence. That is, the probability of seeing (A, B, C) is the same as seeing (C, B, A) as seeing (B, A, C). Exchangeability asserts that the order of observations doesn’t matter.
Let me add a nonrandom model to the pile:
Arbitrary sequence. Assume our observations are a sequence of 0s and 1s fatefully determined before we start measuring them. Our predictions do not influence their outcome. The bits are already set in stone before we start predicting. For people who work in causal inference, this is similar to the assumptions in the Neyman-Rubin model of potential outcomes. Our interactions don’t change the data. They simply reveal what we haven’t seen. In prediction problems, the data that is missing is the future.
These three models give us very similar predictions about bit prediction. They all tell us that predicting the rate in the reference class is about as good as you can do. The i.i.d. model gives the easiest calculations, but all three models for bit streams lead us to the same prediction procedures.
These models converge on the same answer because all three transform prediction problems into missing data problems. They assume that our sequence of bits has been chosen by fate, both in the past and the future. It’s as if we have an infinite deck of cards laid out on a table in front of us. We flip the first few cards and note the colors red and black. Our goal is to infer the colors of the following cards in the sequence. In these models, we are essentially assuming the future has already happened. Since we’ve decided to score our predictions using averages, it’s not too far a deduction to argue that we might as well use the average of what we’ve seen to predict the average of what we haven’t seen. Missing data models implicitly assume this is the best we can do.
I will forever defend the squared loss. We get nice formulas, algorithms, and intuitions, and it works about as well as any other metric in machine learning. We can fight about this later if you want.
If you’re into that sort of thing, you’ll find a cornucopia of less popular models in the proceedings of COLT, FOCS, and STOC. There will be no filtrations on this blog.
Some of this discussion reminded me of "Uncertainty in the Hot Hand Fallacy: Detecting Streaky Alternatives to Random Bernoulli Sequences" which tests the iid assumption on sequence data. The interesting aspect of their paper is the "streaky alternative" model they consider.