This post digs into Lecture 9 of Paul Meehl’s course “Philosophical Psychology.” Technically speaking, this lecture starts at minute 82 of Lecture 8. The video for Lecture 9 is here. Here’s the full table of contents of my blogging through the class.
It’s a bit weird that Probability 1 and Probability 2 have the same name. Probability 1 is a metalinguistic construct about relationships between evidence and conclusions. Probability 2 is an object-linguistic concept about frequencies of events. Probability 2 we can calculate using appropriate combinatorics. Despite a century of effort, we still don’t have a clean algorithm to grind out real number assignments for Probability 1. We don’t know how to compute numerical probabilities of theories from facts. Is it possible that Probability 1 and Probability 2 are just different concepts with the same name?
Probably not. There are striking connections that tell us they must be related. A second natural question is then whether one is a special case of the other. Some religious people think that Probability 1 is just a confused version of Probability 2. These people are called Frequentists. More common are the faithful who think that Probability 2 is the collection of simple, computable cases of Probability 1. These people are called Bayesians. Who’s right? Let’s look at the evidence.
We have the compelling fact that both notions lead to the definition of a fair bet. Many (most?) subjective Bayesians define probability by fair betting. Relative frequency also gives you the correct betting odds for games of chance. Though we can’t always compute Probability 1, when we can, it gives us the same answers as Probability 2. Since Probability 1 applies to more than frequency, perhaps this suggests that Probability 1 is the fundamental concept and Probability 2 is a special case.
Can we make a case that Probability 2 is the fundamental concept? Meehl puts forward a compelling defense of Frequentism, or at least how Probability 1 has to correlate with frequency concepts. When it comes to prediction, frequency takes over.
Suppose you have a Bayesian monk who believes he can predict the future. Let’s call him Jake Gold. Gold makes probabilistic predictions about everything. He writes down all sorts of predictions about everyday life, about sports, and about politics. Everything in his experience gets assigned a likelihood between 0 and 1.
Since the future eventually becomes the present, we can retrospectively evaluate Gold’s predictions. Take his past track record and make a histogram. There will be a collection of future events where Gold declared the probability between 0.7 and 0.8. We can count the frequency they were correct. We can make a similar bin for his predictions between 0 and 0.1, 0.1 and 0.2, and so on. Now, what should the accuracy be in each of these bins? If I take all of Gold’s predictions scored at 70%, those should happen around 70% of the time, no? It seems reasonable that the event rates in each of these bins should be in the ballpark of their midpoint. If the correlation between the inductive, logical prediction algorithm and the frequency of being right were negligible, then we’d all think the forecast was fishy. Maybe frequency, not belief, might be the basic notion behind probability after all.
This notion I’ve described here, a strong correlation between the frequency of predicted events and predicted probabilities, is called calibration. It is a necessary property of a good Probability 1 prediction system. And that necessity is why Bayesians can’t ditch frequency.
Now, calibration isn’t sufficient for a good Probability 1 system. You can have perfectly calibrated forecasts that aren’t particularly useful. Let me give an example due to Rakesh Vorha. Suppose you have a sequence of events that just alternates between being equal to A and B.
A,B,A,B,A,B,A,B,A,B,A,B,....
The prediction goal is to provide the probability of A. An example of a calibrated forecast is
1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,...
100% of the time, when the forecast predicts a 100% certainty of A, A happens. A never happens when the forecast predicts 0% probability. However, another perfectly calibrated forecast is
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,...
Since A and B occur with equal frequency, the forecast is right 50% of the time. Since it only predicts 50%, it is perfectly calibrated.1
While calibration is necessary for a Probability 1 prediction system, it is only part of the story. Having a calibrated system doesn’t tell us that your probabilities are informative or useful (even if Jake Gold spends a lot of time bragging about how impressive his system’s calibration is). This is fine. Predicting fair betting odds is insufficient for a Probability 2 system too. The need for calibration merely confirms frequency is fundamental to probability. With respect to the future, Bayesian predictions need to align with long-run frequencies. Something about frequencies in the past clues us into certainty about the future. And this leads us to Meehl’s penultimate lecture: no matter how confusing we find probability, we can’t deny the astonishing power of statistical prediction.
I bring up Vorha because he and Dean Foster proved something even more surprising about calibration. Though his “always predict 50%” example was contrived, it turns out it’s generalizable. Building a calibrated prediction system is trivial, and you can do it without knowing anything about the events you are predicting. You can make predictions solely to enforce calibration. Foster gives a simple proof of this fact in a later paper. If you have some calibrated bins where the probability forecasts match the correct historical frequencies, predict from one of those bins. If not, you sample a random uncalibrated bin proportional to what will most likely result in calibration. Foster’s proof uses a powerful technique called Blackwell Approachability. More or less, he shows that if you define “calibration” to be your loss function, then an online gradient-like algorithm can find a calibrated forecast with no knowledge about what it should be predicting.
You know exactly what I’d say: Probability 1 and Probability 2 are both called “probability” because they obey Kolmogorov’s axioms. But the axioms won’t tell you how to bridge these concepts, that’s a pragmatic, problem-dependent affair.
Rust in p's.