16 Comments
User's avatar
Jessica Hullman's avatar

This paper has a nice explication of how different strictly proper scoring rules imply different expectations about the kinds of decisions one will face

https://www.cambridge.org/core/services/aop-cambridge-core/content/view/1CE00C0166746CF143388CCDF7926A3A/S0031824800009910a.pdf/pragmatists_guide_to_epistemic_utility.pdf

Expand full comment
Ben Recht's avatar

Oh nice. Thanks!

Expand full comment
Roman W 🇵🇱🇺🇦's avatar

"Suppose you have a model predicting that the event happens with probability Q. How much of your pot should you bet? You could just maximize the expected value of your return. "

There is more than one subtlety here, as those of us who learned or practiced quantitative finance know :)

Subtlety 1: the time value of money. If the payout happens in a year's time, getting paid 1 dollar then is worth less to me now, because of interest rates being non-zero. If the substance of your bet is not connected to interest rates, this is irrelevant, but it becomes relevant if, for example, you're betting on Fed not raising interest rates to 20%.

Subtlety 2 (related to Subtlety 1): Different payoffs related to a bet on a world-changing event happen in different states of the world. Take an extreme example: a bet on a world-ending nuclear war (WENW) happening within 1 year. If I bet on "there will be such a war" and win the bet, the payoff - if I live to receive it - will be worth 0. Hence, it doesn't make sense to bet on the nuclear war happening. Only betting against it can have a positive expected value. You can still work out some information about my estimate of the probability of WENW by proposing the odds B and asking me if I want to bet or not. But in the general case, where the value of the payoff to me conditioned on the resolution on the bet is unknown to you (as it often is), you'll find it hard to work out what my Q is.

This is known in financial literature as the difference between risk-neutral and real-world probabilities.

Expand full comment
Ben Recht's avatar

Yes, no doubt. This is why I used sports betting as an example. Gambling is the one place where these nuances are least subtle. But your point is well taken: the arguments for subjective probabilities break down pretty quickly as you move away from the casino.

Expand full comment
David Rothman's avatar

Just a couple of (perhaps obvious) points from a practitioner of both gambling and quant finance:

1. I never cared for ramsey's quote. Every gambler knows that u always let the other guy propose the line. You never want to make the line (unless u know for sure what side they're betting :-)

2. Many practitioners opt for half-Kelly or more generally lambda-Kelly (lambda <1) which underscores the importance of risk control in decision making under uncertainty. While full Kelly is optimal in a decision theoretic context, real-world constraints generally make a more conservative approach appealing. Esp when dealing with things like haircuts, drawdowns, and bonuses. Bonuses are typically based on risk-adjusted performance, not just the raw returns. Lambda-Kelly leads to less volatile returns and will align better with firm's expectations.

Expand full comment
rif a saurous's avatar

Nice post.

I'd like to flip this around, because I think arguments like this show the futility and silliness of a lot of things people talk about and try to do. I've more than a few times ("often" might be too strong) seen technical people get very excited about trying to provide "better calibrated" forecasts to decision makers, when the decision makers have no actual use for them. If you're making a single decision (not accumulating utility over a sequence of decisions) and you can't size your bet (because it's a decision not a bet), then the only thing you want to know is whether P/(1-P) > 1/B (or equivalently, whether P > 1/(B+1), and you shouldn't be willing to pay for any more accuracy than that.

Expand full comment
Ben Recht's avatar

Agreed. Though I give a mild justification for calibration in the next post...

Expand full comment
rif a saurous's avatar

Yes, although if I understand correctly, the algorithms that argue it's easy to produce calibrated forecasts require a long stream of predictions about the same distribution (whatever the same distribution) is? Arguably, the very statement of "calibration" already requires this?

Expand full comment
Eugène Berta's avatar

Thanks for the insightful post, this utility maximisation framework is very informative indeed.

I am more familiar with the "classical" school of thought:

"They advocate for calibration and appropriate “distributional sharpness.” I’ve tried but never understood why exactly these properties are the gold standard of what we should strive for."

While I agree that the ultimate goal is to maximise the proper score of interest, I would still argue that calibration and refinement are useful metrics and that they actually help achieving that overarching goal.

First, when the calibration error is high, post-hoc calibration directly reduces the proper loss of your prediction algorithm.

Tackling refinement error separately also helps:

In https://arxiv.org/pdf/2501.19195 we introduce a variational re-formulation of the calibration-refinement decomposition of proper scoring rules (Theorem 2.1, details in appendix B) that makes clearer what calibration, refinement and sharpness are and what they reveal about predictions.

Refinement error quantifies how much information about the outcome Y the forecasts contains while calibration error evaluates how sub-optimal predictions are given the information they carry (or how much you can decrease the population risk simply by re-labelling predictions with a measurable function).

We show in the paper that maximising a proper score does not guarantee calibrated forecasts or optimal refinement error, since refinement and calibration errors are not minimised simultaneously in general. A very well known example is that neural networks trained by minimising cross-entropy (a proper loss function, as you showed) end up poorly calibrated.

We propose circumventing this issue by minimising refinement and calibration errors in two separated steps, and show this results in better predictions, as measured by the proper score!

I hope this convinces you that considering the decomposition of proper scores not only provides useful and interpretable information about forecasts, but also simply helps maximising that score.

Expand full comment
Ben Recht's avatar

I'll take a look, thanks. But can you clarify your last statement here: how can you get a higher score when choosing a strategy doesn't maximize the score? Do you mean empirically? Or is this a theorem?

Expand full comment
Eugène Berta's avatar

Sorry I didn't make clear enough the distinction between training and test metrics.

We still train by maximising a proper score (we adopt the equivalent framework of minimising a proper loss like cross-entropy). What we claim is that early stopping on the "validation loss" is sub-optimal because neither "validation calibration error" nor "validation refinement error" are minimised.

A simple example is that if a classifier perfectly separates the training data, it needs to be 100% confident in its predictions to get zero "training calibration error" but this often translates to over-confidence at test time and thus large "test calibration error".

So over-fitting the training set is detrimental for calibration error but it can still help decrease test refinement error (which often means better accuracy), this is sometimes refered to as "benign overfitting" I think.

In most cases however, "test calibration error" can be dealt with quite easily with post-hoc methods. We show empirically that minimising "validation refinement error" during training (by early stopping / selecting hyper-parameters using something different from the score) and dealing with calibration error post-hoc yields better test loss for a wide range of tasks and architectures.

As for the theory, we show that this phenomena occurs even on very simple models by studying high-dimensional asymptotics of regularized logistic regression on a gaussian data model.

Expand full comment
Joachim's avatar

All of this reminds me of another approach from Hossain, Tanjim, and Ryo Okui. "The binarized scoring rule." Review of Economic Studies 80.3 (2013): 984-1001.

Expand full comment
David Duvenaud's avatar

To me this just raises the question of why we want a strictly proper rule? What was the problem with using the merely proper rule?

Expand full comment
Ben Recht's avatar

From this utilitarian view, it's all about risk aversion. Strict rules necessarily imply a risk averse utility function of some kind.

Expand full comment
David Duvenaud's avatar

I understand, but why do we care about risk aversion in the first place? And if you do, why not just add it as a term in your utility function and still do expected utility maximization?

Expand full comment
User's avatar
Comment deleted
Mar 17
Comment deleted
Expand full comment
User's avatar
Comment deleted
Mar 17Edited
Comment deleted
Expand full comment
Roman W 🇵🇱🇺🇦's avatar

"the institutional nuance of having your impatient boss or partner shut down your model that bets on something very rare"

This can be taken into account explicitly by modelling the cost of funding. Even you make money on average by losing 20 times and winning once, the capital has to come from somewhere.

Expand full comment