Appraising Tea Leaves

How do we evaluate probabilistic assertions?

Mar 14, 2025

If you’re a long-time reader, you know I am not a fan of causally quantified probability. However, today, in the name of mathematical pluralism, I want to steelman probabilistic forecasting. In particular, I’m hoping this might shed some light on how probabilities are evaluated in the context of forecasting.

Obviously, accurate forecasts are useful. If I want to make plans for tomorrow, it would be helpful to know in advance what will happen. But I suppose I can never know for sure what will happen. Prognosticators about the future must couch their forecasts in uncertainty. The lingua franca of uncertain predictions has become, for better or worse, mathematical probability.

Forecasters might tell me there’s a 50% chance of rain or a 90% chance of Tesla stock closing lower today. How do I know whose quantified probabilities are right? If I have a bunch of forecasters, how do I choose between them? If I apply my mantra that a system “works if it doesn’t yet not work,” then I would like to choose amongst my forecasts the one that has worked the best so far. This means I need some principled way to compare forecasts.

In order to make sense of what people in information science do today, I have found it’s always best to go look in the literature of the early Cold War. Lo and behold in 1950, Glenn Brier invented a straightforward score based on squared errors. To give a simple form, suppose you are asked to predict K events with binary outcomes. Let q_kdenote your forecast probability for the kth event and y_k equal 1 if the event happened and 0 otherwise. Then the score is

If you predict a 50% chance for all events, the score is -0.25. If you guess all events correctly with 100% certitude, the score is 0. If you guess all of the events incorrectly, the score is -1.

Brier designed his scoring rule to prevent what he calls “forecast hedging.” Based on his experience in weather forecasting, he noted that the particulars of the scoring method “may lead the forecaster to forecast something other than what he thinks will occur, for it is often easier to analyze the effect of different possible forecasts on the verification score than it is to analyze the weather situation.”

In 1970, Leonard Savage built on the work of Brier and others to formalize a general family of scoring rules. Savage considered scoring rules as a way to get forecasters to reveal their “true beliefs.” Like a good Bayesian, Savage wanted a rigorous system to measure people’s internal subjective probabilities. The knee-jerk Bayesian approach to measuring subjective probability is eliciting bets. By badgering you for long enough, I can get you to nail down how much you believe in any logical statement, and this corresponds to your personal probability.

But badgering is annoying. Is there another way? Savage wrote

“This article is about a class of devices by means of which an idealized homo economicus-and therefore, with some approximation, a real person-can be induced to reveal his opinions as expressed by the probabilities that he associates with events, more generally, his personal expectations of random quantities.”

There’s a lot in Savage’s paper, but a salient bit I liked was about utility maximization. Let’s suppose you have a utility maximizing agent. Look, despite what the AI Alignment bros tell you, we all know these don’t exist. Nonetheless, people are attached to this silly, simple model, and I think it helps clarify what’s happening with forecast scores. I’m embracing mathematical pluralism no matter how much it pains me!

A utility maximizing agent has a personal model of the world and some preferences. They choose an action so that, under their model of the world, their expected utility is maximized. Let Q denote a probabilistic model of the outcomes Y. Let A denote an action taken. For each pair of outcome and action, there is an associated reward R(Y,A). Then we can try to maximize the expected utility.

Q here is their personal probability function over events. Such utility maximization gives us a clean way to score a probabilistic model. Let π_Q be the policy that maximizes utility under the model Q:

Suppose the actual world is also probabilistic and outcomes Y are drawn from a true distribution P. Then we can define a score of our forecast in terms of its expected utility.

That is, this score measures the expected utility for using the rule defined by Q when the world obeys the laws of P. Clearly, the best rule is to design a policy using forecasts from the true distribution, P. But this also tells us that if we believe our probabilistic model of reality, we should treat it as true when making predictions to maximize utility.1

If you’re an “incentives” or “mechanism design” sort of person, you could think that by creating the right utility function, your scoring rule becomes ungameable. Deviating from what you believe the world to be becomes “irrational” or whatever (remember, irrational means something very particular for fans of the homo econimus theory of everything). If you want to maximize EV like Sam Bankman-Fried and you believe nature is probabilistic, then you’re best served to find a model that aligns closely with what the world will do.

The only issue with these utility maximizing rules is that many forecasts give the same score. In the next post, I’ll describe ways to make these utility maximizating scores strictly proper so that P has a strictly higher score than any other probabilistic model. And I may touch on some of the fun mathematics of Bregman Divergences we encounter along the way.

For my control theory friends, this is yet another argument for certainty equivalence.

arg min

Discussion about this post