In the last post, I described an interpretation of scoring rules in terms of utility maximization, arguing that risk-averse stochastic optimization demanded accurate forecasts of the future. In this decision-theoretic framework, the forecast errors were measured in terms of the particularities of the decision problem, and forecasters needed to tune their predictions to the particular formulation of utility.
However, some aspects of forecasting should be independent of their use. Are there parts of forecasting that are essential no matter what the task at hand is? Calibration is one such potentially universal component.
Calibration is not asking for much. A forecast is calibrated if whenever the forecaster says something should happen P percent of the time, it happens P percent of the time. The weather service wants to ensure that 50% of the time they predict a 50% chance of rain, it rains. This seems like a very reasonable property of a forecast. Honestly, it would be odd if some prognosticator got such rates wrong. If you told me something happened with probability 0.05, but I did my back-of-the-envelope calculation and saw that this thing happened half of the time, I’d stop asking you for predictions.
So how can you make well-calibrated predictions? You might think that a good Bayesian will always be calibrated. Bayes Rule is all you need, after all. Why would you need to correct a perfect epistemic framework? A classic argument by Phil Dawid shows this simply isn’t the case. Suppose you are a Bayesian forecaster. Given everything you’ve seen before, you decide the probability of rain today is 10%. Then, the events “it rains” and “I am going to check your accuracy at 10%” are conditionally independent, given everything observed by the Bayesian thus far. That is, before we see if it rains today, there’s no reason to think that checking the calibration of a prediction has any impact on the prediction. But this means that your empirical estimate of calibration is always unbiased. If you keep your subjective probabilities updated using Bayes rule, you will never think you are miscalibrated. However, if I were a malicious god, I could ask for your prediction of rain and decide to have the weather be whatever will make you miscalibrated. A subjective Bayesian will always believe they are calibrated, while an adversarial nature can ensure they are miscalibrated.
Probability is weird, folks. There’s a paradox lurking around every corner.
On the other hand, it’s very easy to cook up an algorithm that produces calibrated forecasts with absolutely no knowledge of what it's supposed to be forecasting. Dean Foster and Rakesh Vorha originally demonstrated this back in 1998. More recently, in a 2021 paper with Sergiu Hart, Dean came up with a super simple method for generating calibrated forecasts. You look at your frequency bins and find one where you have predicted a rate of outcomes greater than or equal to what you have observed. (e.g., suppose that when you have predicted 10%, the event has happened 15% of the time thus far). You then choose your forecast randomly, predicting either the overconfident bin or the adjacent bin. Forster and Hart give a closed-form formula for the probability of picking between the two. That’s it. Effectively, you look for places where you seem to be overpredicting and forecast that bin or the adjacent bin to push the rate down. Since you randomly choose between two forecasts, an adversarial god can’t know which bin you’ll pick and, hence, can’t muck up your calibration. This is the joy of mixed strategies.1
Though the algorithm here uses randomness, it gives you calibrated answers even if nature is completely deterministic. It’s just computing rates, and these rates will eventually look as good as if you had seen the entire course of history and the future and calculated the rates from that data. Probability doesn’t need to exist to compute calibrated forecasts. You’ve probably forgotten already, but I was going on about this earlier in the semester. As long as your estimates don’t affect the future, you can compute accurate rates of arbitrary sequences. This fact is tremendously underappreciated.
So it’s interesting: a subjective Bayesian always believes they are calibrated. An adversarial agent can drive them to not be calibrated. And yet, a randomized algorithm can produce a calibrated forecast without any knowledge of what’s happening. Very odd!
This is why you shouldn’t put too much faith in forecasters who brag about their calibration. They should be calibrated! It’s a minimal ask for them to be calibrated. But calibration doesn’t mean they have any particularly clever or novel insights.
If calibration is so easy to achieve, why bother with it? Foster and Hart turn this around: if calibration is so easy to achieve, everyone forecaster should calibrate! Because calibration makes your forecasts legible to everyone else. Calibration imbues a kind of meaning to your probabilities. People appreciate the connections between your predicted rates and your subjective beliefs.
If you are a pragmatic statistician, you would like to use the language of probability to communicate uncertainty, but you want your language to be useful to others. Even without probabilistic arguments, if frequencies in the past are the same as frequencies in the future, then just getting the frequencies right is often enough to solve stochastic optimization problems. A calibrated forecast ensures the team making downstream decisions can understand what your probabilities mean. Calibration is not “all you need,” but it’s a helpful way to communicate what you mean by probability.
IMHO, a truly omniscient adversary would know your random seed, but whatever, it’s a math model.
The problem with trying to get a well-calibrated forecasts is that the edits to the forecast to make it better calibrated (even using the strategy like Forster's or Hart's) might destroy the forecast's utility. It's trivial to get a well-calibrated day-ahead rain forecast for Seattle -- always predict 40%. The trick is to get a calibrated forecast that is also "sharp".
In Bayesian decision theory, you are playing a game with (non-adversarial) Nature, and there is normally a unique optimal calibration. If there's an adversary, you need game theory and a Bayes-Nash equilibrium which need not be unique.