I'm a little confused by the free-throw example. Usually we care about a forecaster's average error, not the error of their average prediction. We judge weather forecasts based on their average error, not based on the error of the average forecast temperature. We wouldn't consider it good forecasting if Nate Silver's presidential-election predictions for 2004–2024 were Kerry, McCain, Romney, Clinton, Trump, and Harris, even though he would have gotten the right number of Republicans.
On a more technical level, I'm sure I'm being extremely dense, but how do you get the final step in the bound after equation (3)? Obviously
I came here to ask the same question (the qualitative one, not the technical one). I admittedly haven't read the paper. I can understand an intuition that says if I'm minimizing the error of the average prediction, and distinct errors are independent then perhaps my expected error is the same. But certainly the variability would be higher right? I'd be better off stabilizing those free throw predictions around the observed value p if the game allows me to do that right? In fact I would have guessed that I'd be better off not only in terms of the variability but also in the expected error. (Maybe this depends on the choice of loss function)
The thing that's weird about this setup is there is no expectation or variance. The sequence of outcomes is chosen adversarially to mess up our score. Even still, you can find a reasonable-looking prediction by finding ways to correct past mistakes no matter how the outcomes are chosen.
I wrote today's post without equations, and I'm trying to figure out how to explain how to get low Brier Scores without predictions too... let me see if I can do that for tomorrow.
In the meantime, I hope the first ten pages of the paper are readable... all the important stuff happens. The remaining 25 pages are just applying the same main idea to many different cost functions/evaluation setups.
Search heuristics for predicting the next value in a sequence which serve as a discursive substitutes for insight - might there also be a connection here with the purported capabilities of another set of high-profile "next token predictors?"
Still working this all through in my head but super interesting! Seems to me that the real benefit of the model, particularly from an org decision-making perspective, is that opportunity for status capture. Clever accounting is a powerful mechanism for framing actions in ways to frame success and accrue the status that comes with it!
I assumed this was a pretty well known fact about online calibration, i.e., if you allow Nature to pick outcomes adversarially to make your forecast look as bad as possible, then your only hope of achieving calibrated predictions over the sequence is through clever posthoc accounting. The more interesting question is when this approach to calibration is useful. I've written some posts on statmodeling blog related to this:
It's known for calibration, and we say so in the introduction, but you can apply the same techniques for expected utility maximization, online learning, prediction with experts, and online conditional conformal prediction. And you can achieve combinations of these objectives at once.
FWIW, I also don't think people appreciate how simple the online algorithms are. Everything reduces to running binary search on an appropriate summary function of the past observations. What I like about the defensive forecasting framework is it tells you how to design online algorithms, which I have always found hard to motivate.
Like others, I'm not following here. Someone who predicted 1 on every throw would be right 70% of the time, whereas the strategy here gets it right 30 % of the time.
OTOH, if the aim is to get as close as possible to the actual rate, this isn't what I would call forecasting, but estimation.
So the approach just helps you guess the correct probabilities. However, if you made bets on the forecasts, you lost 70% of the time. How is that defensive?
This model only works when conditions are the same. Suppose we look at other factors, e.g., which horses ran in a race, turf conditions, etc. How would this help being defensive on horse race betting, even if your probabilities averaged over all other variables were good?
This post was written as a broad introduction without equations. But as I point out, Defensive Forecasting performs well in a surprising number of applications. For example, we describe how to add contextual features like you request in Section 6.
I think the disclaimer at the end of Section 6 is relevant—basically the usual caveats about regret minimization apply. (Unfortunately these are not usually stated clearly in the literature, so thanks Juanky and Ben for including this paragraph!) One implication is that if your competition is using a more accurate model and/or better features than you are, then of course they can consistently beat your defensive forecasts.
> Are these prediction results good? The important point here is that in all problems with sublinear regret, the produced predictions are only as good as the baseline they are compared to. In this case, the baseline is a constant linear prediction function that has access to all of the data in advance. If a linear function provides good predictions, then Defensive Forecasting makes comparably good predictions. Once we make a commitment of how predictions will be evaluated and what they will be compared against, we can run Defensive Forecasting. But we reiterate there is no way to guarantee in advance whether the baseline itself provides a good fit to the data.
I'm a little confused by the free-throw example. Usually we care about a forecaster's average error, not the error of their average prediction. We judge weather forecasts based on their average error, not based on the error of the average forecast temperature. We wouldn't consider it good forecasting if Nate Silver's presidential-election predictions for 2004–2024 were Kerry, McCain, Romney, Clinton, Trump, and Harris, even though he would have gotten the right number of Republicans.
On a more technical level, I'm sure I'm being extremely dense, but how do you get the final step in the bound after equation (3)? Obviously
||F_t|| ≤ M
implies
Σ_t ||F_t||^2 ≤ TM^2,
but it seems like the claim is that
Σ_t ||F_t||^2 ≤ M^2?
Aha! What *do* errors of averages have to do with averages of errors? You have to read the whole thing to find out! :)
Seriously though, it took me a while to believe it even after Juanky showed me how to prove the results of Section 5. Very counterintuitive!
With regards to your technical question, we divide both sides by T to get the average. So your last expression should be
(1/T) Σ_t ||F_t||^2 ≤ M^2
I came here to ask the same question (the qualitative one, not the technical one). I admittedly haven't read the paper. I can understand an intuition that says if I'm minimizing the error of the average prediction, and distinct errors are independent then perhaps my expected error is the same. But certainly the variability would be higher right? I'd be better off stabilizing those free throw predictions around the observed value p if the game allows me to do that right? In fact I would have guessed that I'd be better off not only in terms of the variability but also in the expected error. (Maybe this depends on the choice of loss function)
The thing that's weird about this setup is there is no expectation or variance. The sequence of outcomes is chosen adversarially to mess up our score. Even still, you can find a reasonable-looking prediction by finding ways to correct past mistakes no matter how the outcomes are chosen.
I wrote today's post without equations, and I'm trying to figure out how to explain how to get low Brier Scores without predictions too... let me see if I can do that for tomorrow.
In the meantime, I hope the first ten pages of the paper are readable... all the important stuff happens. The remaining 25 pages are just applying the same main idea to many different cost functions/evaluation setups.
Hmm...but isn't the claim
(1/T) Σ_t ||F_t||^2 ≤ M^2/T,
not
(1/T) Σ_t ||F_t||^2 ≤ M^2?
Oh, I see what you mean (equations in substack comments! LOL!)
We show:
||Σ_t F_t||^2 ≤ Σ_t ||F_t||^2 ≤ TM^2
Now divide both sides by T^2 and take a square root.
Ah, now I see the problem! I think there's a typo. The denominator in the middle expression should be T^2, not T.
Ah, you are right! Thanks. We'll fix this in the next revision.
Search heuristics for predicting the next value in a sequence which serve as a discursive substitutes for insight - might there also be a connection here with the purported capabilities of another set of high-profile "next token predictors?"
We leave this question to future work. :)
Still working this all through in my head but super interesting! Seems to me that the real benefit of the model, particularly from an org decision-making perspective, is that opportunity for status capture. Clever accounting is a powerful mechanism for framing actions in ways to frame success and accrue the status that comes with it!
I assumed this was a pretty well known fact about online calibration, i.e., if you allow Nature to pick outcomes adversarially to make your forecast look as bad as possible, then your only hope of achieving calibrated predictions over the sequence is through clever posthoc accounting. The more interesting question is when this approach to calibration is useful. I've written some posts on statmodeling blog related to this:
https://statmodeling.stat.columbia.edu/2024/08/14/when-is-calibration-enough/
https://statmodeling.stat.columbia.edu/2024/11/01/calibration-is-sometimes-sufficient-for-trusting-predictions-what-does-this-tell-us-when-human-experts-use-model-predictions/
https://statmodeling.stat.columbia.edu/2024/12/31/calibration-resolves-epistemic-uncertainty-by-giving-predictions-that-are-indistinguishable-from-the-true-probabilities-why-is-this-still-unsatisfying/
It's known for calibration, and we say so in the introduction, but you can apply the same techniques for expected utility maximization, online learning, prediction with experts, and online conditional conformal prediction. And you can achieve combinations of these objectives at once.
FWIW, I also don't think people appreciate how simple the online algorithms are. Everything reduces to running binary search on an appropriate summary function of the past observations. What I like about the defensive forecasting framework is it tells you how to design online algorithms, which I have always found hard to motivate.
Like others, I'm not following here. Someone who predicted 1 on every throw would be right 70% of the time, whereas the strategy here gets it right 30 % of the time.
OTOH, if the aim is to get as close as possible to the actual rate, this isn't what I would call forecasting, but estimation.
So the approach just helps you guess the correct probabilities. However, if you made bets on the forecasts, you lost 70% of the time. How is that defensive?
This model only works when conditions are the same. Suppose we look at other factors, e.g., which horses ran in a race, turf conditions, etc. How would this help being defensive on horse race betting, even if your probabilities averaged over all other variables were good?
This post was written as a broad introduction without equations. But as I point out, Defensive Forecasting performs well in a surprising number of applications. For example, we describe how to add contextual features like you request in Section 6.
https://arxiv.org/abs/2506.11848
Thank you. I will read the [long] paper.
I think the disclaimer at the end of Section 6 is relevant—basically the usual caveats about regret minimization apply. (Unfortunately these are not usually stated clearly in the literature, so thanks Juanky and Ben for including this paragraph!) One implication is that if your competition is using a more accurate model and/or better features than you are, then of course they can consistently beat your defensive forecasts.
> Are these prediction results good? The important point here is that in all problems with sublinear regret, the produced predictions are only as good as the baseline they are compared to. In this case, the baseline is a constant linear prediction function that has access to all of the data in advance. If a linear function provides good predictions, then Defensive Forecasting makes comparably good predictions. Once we make a commitment of how predictions will be evaluated and what they will be compared against, we can run Defensive Forecasting. But we reiterate there is no way to guarantee in advance whether the baseline itself provides a good fit to the data.