Maybe You're Wrong
Quantiles, Prediction Intervals, and what theory can tell you about the future.
This is a live blog of Lecture 16 of the 2025 edition of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents is here.
Let’s say I have a fleet of electric, autonomous vehicles that I have set loose as a taxi service upon some urban area. I’d really prefer that these vehicles don’t run out of battery with a passenger in the back seat. Before every ride, I estimate the amount of charge my robot will need to complete its route. I base this estimate on the past times my cars have done similar routes. But I’m not happy with a point estimate. My dataset shows that the amount of battery discharged on the same route can vary because of all sorts of unpredictabilities, like traffic, construction, or emergencies. So rather than using a point estimate, I look at the distribution of past discharges and base my ride assignment on a quantile: “95% of the time, cars used this much battery on similar routes.” If I’m running out of battery for 1 out of 20 passengers, that’s no big deal on my bottom line. I’m losing thousands of dollars on each ride anyway.
This hypothetical motivates the prediction interval. In machine learning problems, rather than returning a point estimate, you can return a quantile. That quantile can capture some of the variability in your training data.
It’s not hard to imagine a quantile version of the hold-out method. Build some prediction function with one set of data. Use a second set of data to build a quantile.
For the past decade, everyone has been calling this conformal prediction.
“Instead of returning the mean, return the quantile” is fine, I suppose. And you can extend this idea to come up with cute constructions for classification problems where you allow a classifier to return “I don’t know.” With this new option to defer, no matter how good the classifier was before, it’s now “right” 95% of the time.
Last year, I wrote a series of posts about prediction intervals. I should put this together into one document at some point. The takeaway was that quantiles are fine. They are not magic. Once you comb through the theorems, conformal prediction guarantees much less than you’d hope it would. It just amounts to using a test set to compute a quantile rather than a point prediction. Quantiles potentially lead to more conservative decision rules downstream. There’s nothing wrong with being risk-averse with untrustworthy machine learning, I suppose.
But we also shouldn’t get too excited about the guarantees and try to snow people with mathematical statistics. Conformal prediction proponents tell us that we are “rigorously” adding risk aversion. This fetishization of rigor bothers me to no end, and it’s worth repeating the main caveats you have to keep in mind if you want to yell conformal in your experiments section.
First, conformal prediction advocates claim their theory is “distribution-free.” But the theorems still only hold if your data is sampled from a distribution. That’s not distribution-free! Asserting an exchangeable process is assuming there’s a distribution—a distribution with very strong assumptions put upon it. We all know that seldom is there a distribution from which we’re sampling. There are no distributions in machine learning. Just because you’re not saying “Gaussian” doesn’t mean that your intervals are somehow more valid in reality.1
And even when you do have a probability distribution from which you sample, you have to deal with the nebulous nature of the ex ante guarantee. Most conformal prediction guarantees are deeply misleading. They assert that the probability a new observation falls outside of your predicted set is 95%. But that probability is over the new event and the training set. It’s an ex ante guarantee, not an ex post guarantee.
As I’ve written before, you have to be suspicious when theorems promise you 95% accuracy no matter what the sample size. A fairer assessment of conformal prediction methods is PAC-style: you’d like a theorem that states “with probability 1-delta over the training set, the next sample will be in my prediction interval with probability 1-alpha.” If you analyze conformal prediction this way, you’ll see that you can’t get around the law of large numbers and need astronomically large samples to get precise prediction intervals. Moreover, you’ll find you could have extracted similar guarantees from the Dvoretzky–Kiefer–Wolfowitz inequality, first proven in 1956.
My read of conformal prediction is as a post-hoc motivation of quantiles. If you want a 95% quantile on the future, and you believe the future is like the past, compute a 97% quantile on your data and release that. If you want to be really sure that 95% of the future is in there, release the 100% quantile. There’s still a chance that you’ll see something even more extreme, but if the distributional assumption is true, you can bound the odds.
Of course, you never know if the distributional assumption is valid. But conformal prediction and related methods provide a setting in which the heuristic extraction of policies from whisker plots is probabilistically valid.
I guess the main point I’d like to make in this class about uncertainty quantification is that not all uncertainty quantification is explicitly probabilistic. My scale measures coffee to 0.1 grams. What is the confidence interval when I weigh out 19.0 grams for a pour over? I don’t really care because I’m never operating at the scale’s calibrated precision. We’ve all become accustomed to believing this measurement is good up to that small amount, and we move on with our lives accordingly. Would I trust my scale more if it shipped with a theorem?
I’m picking on frequentist uncertainty quantification, but I have seen nothing better from the Bayesians. I’m still waiting for my MCMC to mix over here…


Still superficial—and linking to already debunked posts adds nothing. The only credible point concerns sample size for guarantees; the ‘500 points are sufficient’ claim seems limited to a few Berkeley commenters. As the proverb says, ‘The dog barks, but the caravan moves on.’ Notably, Michael Jordan and Emmanuel Candès are strong supporters of conformal prediction.
Could you please expand on "But the theorems still only hold if your data is sampled from a distribution"? Your data collection mechanism defines the distribution. If it's something like "take the first 10 subjects/objects I see" you might lose exchangeability, but you still have a distribution over those 10. If I have a sampling frame over a defined and stable list and I choose a subset using a "random" method then I get a bunch of other properties, too. (Like an external "population" distribution)