As we plow through this week on prediction intervals, each post gets more and more niche. Today, let’s talk about how to use prediction intervals to quantify the uncertainty of machine learning models. Say I’ve trained some neural net to predict some real-valued outcome y. On a newly sampled x, perhaps returning a point estimate for y isn’t informative enough. My prediction function might not be that good, and I might want a conservative estimate of the possible outcomes.

Let me give a contrived example, as I’m known to do. I’m trying to predict how long it will take me to clean up my house. I’ve carefully diaried all of the past times I’ve cleaned my house, noting features about the relative messiness of the rooms. I can then predict the time using a neural network on past cleanings. But this neural network is only going to give me a point estimate. What if I really need to make it to a friend’s party and need a better estimate of the worst-case time? I would like a statement like, “I’m just a language model, but I’m quite sure it will take between 30 and 90 minutes.” Can we train the neural net to output such ranges?

Here’s a procedure that uses what we’ve learned this week.

Split my data into two sets: a training set and a calibration set.

Fit a prediction function with the training set.

Record the errors of my training set on the calibration set. That is, if x is list of the input features and y is the output, build a set of errors e = |f(x)-y| on the validation set.

Now build a prediction interval for the errors. If you are using

*m*calibration examples, find an E such that, on the calibration set, e is less than or equal to E with rate r = 0.95 + sqrt(6.1/*m*).

With this algorithm, the next y will be between f(x)-E and f(x)+E with probability at least 95%

This seems neat. For any function, perhaps one you trained in some inscrutable way in pytorch, I can always use some validation data to estimate the uncertainty of the next sample. (Again, provided that the next sample is independent yet identically distributed as my validation data.)

What’s nice about this procedure is it feels a bit better than recording just the average or median error on a test set as a way of comparing the usefulness of predictors. This probabilistic guarantee makes it feel like I’m quantifying uncertainty more carefully because I am computing an interval rather than just an average error. Maybe this is right? In most machine learning papers, a regression error gets recorded as a single mean-squared error in some table. Would a better convention be plotting the empirical distribution function of the errors on the test set? You tell me if that would be more useful.

You might find it annoying that the size of the prediction band is uniform for all inputs in the above algorithm. What if you think some instances should be easier to predict than others? Well, now you have to be more clever, but you can always reduce things back to estimating cumulative distributions of something. For instance, you could fit two functions to your data, directly estimating an upper and lower bound, U(x) and L(x). Then, you can build a prediction interval of the score

e = max(L(x)-y, y-U(x))

For the appropriate quantile level E, you’ll get the prediction band [L(x)-E, U(x)+E]. This is just one idea, and there are other things you could do. For example, Tengyuan Liang has a neat paper showing how to simultaneously estimate the prediction function and its squared error with excellent out-of-sample guarantees. If you want prediction bands, there are plenty of ways to get them.

But I still come back to asking why we want prediction bands. My house cleaning example is contrived because I can never come up with good applications for prediction bands around neural networks. I’ve seen some examples in the literature, and they all seem just as contrived as my example.

I’ve heard that you would use these prediction bands to estimate the efficacy of a drug. But I’m confused here because that’s what randomized trials are for. I don’t understand what prediction intervals would add. Some people argue prediction intervals can be used to estimate risks of credit defaults. But credit defaults are either-or random variables, so you just need boring confidence intervals around probabilities.

And all of the examples people list come back to *doing something*. I make predictions so that I can act later. But the sorts of guarantees we’re getting from prediction intervals don’t really help us there. People often interpret this prediction band as saying the probability y will be outside the prediction band for f(x) for *any* x is 95%. But that is not what the math tells us here. We need to be more precise. The theory says there’s a 95% chance that the *next* f(x) and y are close. In order for the math to be valid, *both* the future outcomes *and *the future features must occur in the same way they occurred in the past. If you pick x in some other way, this prediction interval math can’t guarantee anything about what y you’ll see. That means that if you are trying to act to change an outcome, this whole framework isn’t helpful.

Since I’m puzzled, I hope some people will set me straight in the comments. Tell me examples where we can use prediction intervals for nonparametric regression with marginal guarantees.

I’ll admit it: though I spend a lot of time thinking about prediction, I always come away confused by the “math” that claims to solve it. Has this entire week on substack been a subtweet of conformal prediction? Yes. Yes it has. In a rather technical post tomorrow, I’ll be more direct and explain why the elegant conformal guarantees are just a different (and misleading) way of describing the estimation of empirical distributions of scalar residuals.

Hi Ben,

such intervals seem to me quite useful to formulate (robust) optimisation problems. A specific type being for instance scheduling problems where you have a set of events and a partial order relation over them. The intervals would provide lower and upper bounds on the time elapsed between events (e. g. "Start cleaning house", "Finish cleaning house").

I would love to have close look at those prediction bands in the context of lookahead and bandit algorithms for approximate dynamic programming (MCTS, and beyond).

I want "honest" unbiased prediction bands around the following.

1) Predictions of personalized treatment decisions i.e Expected benefit of treatment a over treatment b, given person-specific covariates. This comes up in cancer and psychiatry a lot. Most "ML predictions" are population quantities that might be totally inadequate for use in clinical decision making.

2) Change in polygenic risk scores. In this case if someone claims to be able to rank embroyos for screening, I want to see the prediction bands for expected risk reduction in picking embryo A vs. embryo B or some relaxed version of it (top 10% quantile over next 10% quantile.)

3) For drug development, you need good prediction bands around expected future performance of a large number of candidates from high throughput screens. Quite challenging to evaluate accuracy of intervals here for each individual candidate as most candidates have rarely been measured in expensive high fidelity experiments

Really wish reviewers for Nature journals actually understood why this practically necessary for all the applications they are prematurely excited about.