such intervals seem to me quite useful to formulate (robust) optimisation problems. A specific type being for instance scheduling problems where you have a set of events and a partial order relation over them. The intervals would provide lower and upper bounds on the time elapsed between events (e. g. "Start cleaning house", "Finish cleaning house").

I would love to have close look at those prediction bands in the context of lookahead and bandit algorithms for approximate dynamic programming (MCTS, and beyond).

I want "honest" unbiased prediction bands around the following.

1) Predictions of personalized treatment decisions i.e Expected benefit of treatment a over treatment b, given person-specific covariates. This comes up in cancer and psychiatry a lot. Most "ML predictions" are population quantities that might be totally inadequate for use in clinical decision making.

2) Change in polygenic risk scores. In this case if someone claims to be able to rank embroyos for screening, I want to see the prediction bands for expected risk reduction in picking embryo A vs. embryo B or some relaxed version of it (top 10% quantile over next 10% quantile.)

3) For drug development, you need good prediction bands around expected future performance of a large number of candidates from high throughput screens. Quite challenging to evaluate accuracy of intervals here for each individual candidate as most candidates have rarely been measured in expensive high fidelity experiments

Really wish reviewers for Nature journals actually understood why this practically necessary for all the applications they are prematurely excited about.

Isn't the challenge that the "ML predictions" are from models that likely aren't making predictions that are useful for *decision-making*? For example, the models I'm thinking of fit outcome prediction models to observational data. I could put intervals around the predictions, but this wouldn't change the fact that I likely don't have the requisite data to make counterfactual predictions. Would you agree with this characterization?

Absolutely agree that in the counterfactual prediction context, the first problem is that one needs to have a solid basis for why identifying the counterfactual effect is feasible. But subsequently, you do want adaptive prediction bands/conditional coverage of some kind around these counterfactual predictions. I'm thinking of scenarios where there is some basis for identification (e.g could be RCTs or some other well justified intrumental variable exists), but then one also needs some degree of accuracy in individual predictions to justify personalized decision making.

Here is another one. Any putative candidate for a monitoring biomarker for therapeutic engagement requires tolerance intervals of what values the biomarker takes typically under a length of time common for a drug intervention trial. So if someone wants to use EEG/fMRI biomarkers to evaluate whether a pharmacological intervention works, they need to understand the expected day-to-day variability that might occur under a placebo or treatment-as-usual setting over a 12 week period first.

The comment about things falling apart when the "DGP" changes.. what does that have to do with a prediction interval? Any old ML prediction, interval-ed or not, suffers from this breakdown no? Also, this breakdown itself could be what the prediction (interval) is constructed for (say monitoring for change detection, trying to invalidate itself). On a different note, quantile regression is really fascinating isn't it?

I read the third-from-last paragraph in this post as if you were saying this was an issue specific to interval predictions. Quantile regression is like sorting without actually sorting, which may or may not be fascinating.

Ah, gotcha. I'm after something else in that paragraph: the difference between marginal and conditional coverage.

And yeah, you're right that quantile regression is basically just sorting. Which is why thinking it can solve all of uncertainty quantification is bizarre.

> prediction bands to estimate the efficacy of a drug. But I’m confused here because that’s what randomized trials are for

I'm guessing what people want here is prediction bands for conditional average treatment effects for the drug or some such thing.

> In order for the math to be valid, both the future outcomes and the future features must occur in the same way they occurred in the past

Yes, I think for them to be useful one needs to design studies/experiments to make this actually true. Very hard to convince people to collect enough data to make estimating uncertainty accurate though.

How about weather forecasting as an application where prediction intervals for nonparametric regression with marginal guarantees can be useful? The residuals are plausibly exchangeable, and it's not a setting where we are using the forecasts to influence the outcome.

Agreed r.e. exchangeability of the residuals, and we definitely don't believe the underlying time series is exchangeable. We could try to account for that too though (e.g., [1, 2, 3]; I have no idea how well these work in practice).

To answer your question though: let's say, not so hypothetically, that I want to play tennis this weekend. I'm deciding whether to pay X >> 0 to reserve an indoor court, which I must do a few days in advance, or hope that it's warm enough to play for free at the park. Google gives me a point prediction of 49 degrees for this Sunday. What should I do? A "distribution-free" prediction interval (even one with only a marginal guarantee) seems valuable for informing my decision, right?

Well, if the prediction band is 47-51, does it change your decision? What about if it's 45-55? What about if it's 30-70? When does the uncertainty band become useful?

Hmm fair question. I guess I have some threshold above which it's reasonable to play outdoors (let's say 40 degrees) and it would be good to know with e.g., 90% confidence that it will be >= 40 degrees. That is an either-or event though, so maybe (as you said in the post) what I *actually* want is a confidence interval around the probability of that event.

Weather prediction is a great example, and this is a good point. However, supposing that different people have different thresholds for "too cold", reporting a single prediction interval seems preferable to sending personalized "it is likely to be too cold today" alerts to individuals whose thresholds overlap with the prediction interval.

As a farmer, you might be interested in the quantity of rain for the next few days. For instance, if there is a good chance that it will exceed e.g. 5 mm, you might want to harvest earlier than usual to trade maturity vs dryness of the grain and limit the risks of having a significant fraction of your harvest that would rot or be too expensive to dry in the warehouse.

Google Maps gives you estimated time intervals rather than point estimates if you plan a car trip in the future (e.g. same time but tomorrow). This probably reflects uncertainty in the traffic.

As a user I find this pretty useful if you have to make a decision between several modalities (e.g. public transportation vs bike vs taxi).

Yes deciding to switch modality based on the width of the interval, or deciding to wake-up and leave much earlier to decrease the risk of missing your very important appointment if there is no alternative modalities with a narrower prediction intervals.

Any active decision making process where you get to choose where to gather more information? Eg. Bayesian optimisation/ Kriging to determine where to drill for gold based on a predictive model of the reef.

Trying to understand, is the problem that the data collection process imbalances the training set so this method of uncertainty quantification starts to fail?

You might want to check out the recent work on conformal decision theory (https://conformal-decision.github.io/) if you haven't seen it. I'm not sure how well it meshes with the case you're making here, but it seemed roughly on theme: to make contextually relevant risk-relevant decisions, you're better off calibrating the decisions directly and skipping the coverage sets.

The probability of an event and the confidence interval of a prediction are actually two sides of the same coin. Take a simple linear model with normal errors, the confidence interval that a prediction <= some threshold corresponds to the probability returned from the equivalent probit regression.

Hi Ben,

such intervals seem to me quite useful to formulate (robust) optimisation problems. A specific type being for instance scheduling problems where you have a set of events and a partial order relation over them. The intervals would provide lower and upper bounds on the time elapsed between events (e. g. "Start cleaning house", "Finish cleaning house").

I would love to have close look at those prediction bands in the context of lookahead and bandit algorithms for approximate dynamic programming (MCTS, and beyond).

I want "honest" unbiased prediction bands around the following.

1) Predictions of personalized treatment decisions i.e Expected benefit of treatment a over treatment b, given person-specific covariates. This comes up in cancer and psychiatry a lot. Most "ML predictions" are population quantities that might be totally inadequate for use in clinical decision making.

2) Change in polygenic risk scores. In this case if someone claims to be able to rank embroyos for screening, I want to see the prediction bands for expected risk reduction in picking embryo A vs. embryo B or some relaxed version of it (top 10% quantile over next 10% quantile.)

3) For drug development, you need good prediction bands around expected future performance of a large number of candidates from high throughput screens. Quite challenging to evaluate accuracy of intervals here for each individual candidate as most candidates have rarely been measured in expensive high fidelity experiments

Really wish reviewers for Nature journals actually understood why this practically necessary for all the applications they are prematurely excited about.

Re: 1

Isn't the challenge that the "ML predictions" are from models that likely aren't making predictions that are useful for *decision-making*? For example, the models I'm thinking of fit outcome prediction models to observational data. I could put intervals around the predictions, but this wouldn't change the fact that I likely don't have the requisite data to make counterfactual predictions. Would you agree with this characterization?

Absolutely agree that in the counterfactual prediction context, the first problem is that one needs to have a solid basis for why identifying the counterfactual effect is feasible. But subsequently, you do want adaptive prediction bands/conditional coverage of some kind around these counterfactual predictions. I'm thinking of scenarios where there is some basis for identification (e.g could be RCTs or some other well justified intrumental variable exists), but then one also needs some degree of accuracy in individual predictions to justify personalized decision making.

Here is another one. Any putative candidate for a monitoring biomarker for therapeutic engagement requires tolerance intervals of what values the biomarker takes typically under a length of time common for a drug intervention trial. So if someone wants to use EEG/fMRI biomarkers to evaluate whether a pharmacological intervention works, they need to understand the expected day-to-day variability that might occur under a placebo or treatment-as-usual setting over a 12 week period first.

https://www.ncbi.nlm.nih.gov/books/NBK402282/

The comment about things falling apart when the "DGP" changes.. what does that have to do with a prediction interval? Any old ML prediction, interval-ed or not, suffers from this breakdown no? Also, this breakdown itself could be what the prediction (interval) is constructed for (say monitoring for change detection, trying to invalidate itself). On a different note, quantile regression is really fascinating isn't it?

Clarification question about your first question: Do you mean the current post here, or this one?

https://www.argmin.net/p/when-is-the-future-the-same-as-the

After too much time in learning theory, I'm not convinced that quantile regression is particularly interesting, but that's a me problem.

I read the third-from-last paragraph in this post as if you were saying this was an issue specific to interval predictions. Quantile regression is like sorting without actually sorting, which may or may not be fascinating.

Ah, gotcha. I'm after something else in that paragraph: the difference between marginal and conditional coverage.

And yeah, you're right that quantile regression is basically just sorting. Which is why thinking it can solve all of uncertainty quantification is bizarre.

> prediction bands to estimate the efficacy of a drug. But I’m confused here because that’s what randomized trials are for

I'm guessing what people want here is prediction bands for conditional average treatment effects for the drug or some such thing.

> In order for the math to be valid, both the future outcomes and the future features must occur in the same way they occurred in the past

Yes, I think for them to be useful one needs to design studies/experiments to make this actually true. Very hard to convince people to collect enough data to make estimating uncertainty accurate though.

How about weather forecasting as an application where prediction intervals for nonparametric regression with marginal guarantees can be useful? The residuals are plausibly exchangeable, and it's not a setting where we are using the forecasts to influence the outcome.

I'd want some evidence that the residuals are exchangeable before diving in, but I like this example. What would such prediction bands be helpful for?

edited Feb 8Agreed r.e. exchangeability of the residuals, and we definitely don't believe the underlying time series is exchangeable. We could try to account for that too though (e.g., [1, 2, 3]; I have no idea how well these work in practice).

To answer your question though: let's say, not so hypothetically, that I want to play tennis this weekend. I'm deciding whether to pay X >> 0 to reserve an indoor court, which I must do a few days in advance, or hope that it's warm enough to play for free at the park. Google gives me a point prediction of 49 degrees for this Sunday. What should I do? A "distribution-free" prediction interval (even one with only a marginal guarantee) seems valuable for informing my decision, right?

[1] https://arxiv.org/abs/2210.02271

[2] https://arxiv.org/abs/2010.09107

[3] https://arxiv.org/abs/1802.06300

Well, if the prediction band is 47-51, does it change your decision? What about if it's 45-55? What about if it's 30-70? When does the uncertainty band become useful?

Hmm fair question. I guess I have some threshold above which it's reasonable to play outdoors (let's say 40 degrees) and it would be good to know with e.g., 90% confidence that it will be >= 40 degrees. That is an either-or event though, so maybe (as you said in the post) what I *actually* want is a confidence interval around the probability of that event.

Weather prediction is a great example, and this is a good point. However, supposing that different people have different thresholds for "too cold", reporting a single prediction interval seems preferable to sending personalized "it is likely to be too cold today" alerts to individuals whose thresholds overlap with the prediction interval.

As a farmer, you might be interested in the quantity of rain for the next few days. For instance, if there is a good chance that it will exceed e.g. 5 mm, you might want to harvest earlier than usual to trade maturity vs dryness of the grain and limit the risks of having a significant fraction of your harvest that would rot or be too expensive to dry in the warehouse.

In my follow-up post, I described a frost warning, which is a similar agricultural example. https://www.argmin.net/p/predictions-and-actions-redux

Google Maps gives you estimated time intervals rather than point estimates if you plan a car trip in the future (e.g. same time but tomorrow). This probably reflects uncertainty in the traffic.

As a user I find this pretty useful if you have to make a decision between several modalities (e.g. public transportation vs bike vs taxi).

How do you use them? Is the idea that if the interval is too wide, you pick a less risky mode of transportation?

edited Mar 8Yes deciding to switch modality based on the width of the interval, or deciding to wake-up and leave much earlier to decrease the risk of missing your very important appointment if there is no alternative modalities with a narrower prediction intervals.

Any active decision making process where you get to choose where to gather more information? Eg. Bayesian optimisation/ Kriging to determine where to drill for gold based on a predictive model of the reef.

The main problem with Bayesian Optimization is that the new sampling will cause the future points to not look like the past

Trying to understand, is the problem that the data collection process imbalances the training set so this method of uncertainty quantification starts to fail?

You might want to check out the recent work on conformal decision theory (https://conformal-decision.github.io/) if you haven't seen it. I'm not sure how well it meshes with the case you're making here, but it seemed roughly on theme: to make contextually relevant risk-relevant decisions, you're better off calibrating the decisions directly and skipping the coverage sets.

The probability of an event and the confidence interval of a prediction are actually two sides of the same coin. Take a simple linear model with normal errors, the confidence interval that a prediction <= some threshold corresponds to the probability returned from the equivalent probit regression.