arg min

Aha! What *do* errors of averages have to do with averages of errors? You have to read the whole thing to find out! :)

Seriously though, it took me a while to believe it even after Juanky showed me how to prove the results of Section 5. Very counterintuitive!

With regards to your technical question, we divide both sides by T to get the average. So your last expression should be

(1/T) Σ_t ||F_t||^2 ≤ M^2

Expand full comment

Reply (2)

Itamar L

I came here to ask the same question (the qualitative one, not the technical one). I admittedly haven't read the paper. I can understand an intuition that says if I'm minimizing the error of the average prediction, and distinct errors are independent then perhaps my expected error is the same. But certainly the variability would be higher right? I'd be better off stabilizing those free throw predictions around the observed value p if the game allows me to do that right? In fact I would have guessed that I'd be better off not only in terms of the variability but also in the expected error. (Maybe this depends on the choice of loss function)

Expand full comment

The thing that's weird about this setup is there is no expectation or variance. The sequence of outcomes is chosen adversarially to mess up our score. Even still, you can find a reasonable-looking prediction by finding ways to correct past mistakes no matter how the outcomes are chosen.

I wrote today's post without equations, and I'm trying to figure out how to explain how to get low Brier Scores without predictions too... let me see if I can do that for tomorrow.

In the meantime, I hope the first ten pages of the paper are readable... all the important stuff happens. The remaining 25 pages are just applying the same main idea to many different cost functions/evaluation setups.

Expand full comment

Matt Hoffman

Hmm...but isn't the claim

(1/T) Σ_t ||F_t||^2 ≤ M^2/T,

not

(1/T) Σ_t ||F_t||^2 ≤ M^2?

Expand full comment

Oh, I see what you mean (equations in substack comments! LOL!)

We show:

||Σ_t F_t||^2 ≤ Σ_t ||F_t||^2 ≤ TM^2

Now divide both sides by T^2 and take a square root.

Expand full comment

Matt Hoffman

Ah, now I see the problem! I think there's a typo. The denominator in the middle expression should be T^2, not T.

Expand full comment

Ah, you are right! Thanks. We'll fix this in the next revision.

Expand full comment

brutalist

Search heuristics for predicting the next value in a sequence which serve as a discursive substitutes for insight - might there also be a connection here with the purported capabilities of another set of high-profile "next token predictors?"

Expand full comment

We leave this question to future work. :)

Expand full comment

lily

Jul 2

Section 9 seems related to Multiplicative-Weight Updates - is there a connection?

Expand full comment

Jul 2

Yes, both algorithms aim to solve the same problem: prediction with expert advice. One way to view the multiplicative weights algorithm as implementing online mirror descent to solve this problem. The algorithm in section 9 uses defensive forecasting instead of mirror descent.

Expand full comment

Jeff Roberts

Jun 17

Still working this all through in my head but super interesting! Seems to me that the real benefit of the model, particularly from an org decision-making perspective, is that opportunity for status capture. Clever accounting is a powerful mechanism for framing actions in ways to frame success and accrue the status that comes with it!

Expand full comment

Jessica Hullman

https://statmodeling.stat.columbia.edu/2024/08/14/when-is-calibration-enough/

I assumed this was a pretty well known fact about online calibration, i.e., if you allow Nature to pick outcomes adversarially to make your forecast look as bad as possible, then your only hope of achieving calibrated predictions over the sequence is through clever posthoc accounting. The more interesting question is when this approach to calibration is useful. I've written some posts on statmodeling blog related to this:

https://statmodeling.stat.columbia.edu/2024/11/01/calibration-is-sometimes-sufficient-for-trusting-predictions-what-does-this-tell-us-when-human-experts-use-model-predictions/

https://statmodeling.stat.columbia.edu/2024/12/31/calibration-resolves-epistemic-uncertainty-by-giving-predictions-that-are-indistinguishable-from-the-true-probabilities-why-is-this-still-unsatisfying/

Expand full comment

It's known for calibration, and we say so in the introduction, but you can apply the same techniques for expected utility maximization, online learning, prediction with experts, and online conditional conformal prediction. And you can achieve combinations of these objectives at once.

FWIW, I also don't think people appreciate how simple the online algorithms are. Everything reduces to running binary search on an appropriate summary function of the past observations. What I like about the defensive forecasting framework is it tells you how to design online algorithms, which I have always found hard to motivate.

Expand full comment

lily

Jul 2

Section 9 seems related to Multiplicative-Weight Updates - is there a connection?

Expand full comment

John Quiggin

Jun 17

Like others, I'm not following here. Someone who predicted 1 on every throw would be right 70% of the time, whereas the strategy here gets it right 30 % of the time.

OTOH, if the aim is to get as close as possible to the actual rate, this isn't what I would call forecasting, but estimation.

Expand full comment

Alex Tolley

So the approach just helps you guess the correct probabilities. However, if you made bets on the forecasts, you lost 70% of the time. How is that defensive?

This model only works when conditions are the same. Suppose we look at other factors, e.g., which horses ran in a race, turf conditions, etc. How would this help being defensive on horse race betting, even if your probabilities averaged over all other variables were good?

Expand full comment

https://arxiv.org/abs/2506.11848

This post was written as a broad introduction without equations. But as I point out, Defensive Forecasting performs well in a surprising number of applications. For example, we describe how to add contextual features like you request in Section 6.

Expand full comment

Alex Tolley

Thank you. I will read the [long] paper.

Expand full comment

Matt Hoffman