Cover Songs

You can't always get what you want

Feb 09, 2024

Thanks to everyone who posted applications for prediction bands yesterday. I plan on responding to all the comments in a final post on this topic next week. But before I get to them, I want to address a higher-level question: Can conformal prediction achieve the uncertainty quantification we need for decision-making?

What is conformal prediction anyway? It’s the algorithm I described yesterday, wrapped in mathematical deceit. From page 5 in Angelopoulos and Bates’ Foundations and Trends monograph, the conformal prediction method goes as follows:

Start with any prediction function and a calibration set of size n.
Identify a heuristic notion of uncertainty using the pre-trained model.
Define a real-valued score function s(x, y)
Find the smallest q larger than (1-alpha)*F of the scores on the calibration set.1
For any x, return the prediction set C(X) = { all y such that s(X,y) <= q }

The conformal acolytes delight in telling you about how this C(X) has guaranteed coverage: if your calibration set AND the future point (x,y) are generated by some exchangeable process, then the probability that you compute a prediction set containing y is 1-alpha. This probabilistic coverage is guaranteed no matter how many calibration samples you use. The coverage is guaranteed no matter what the score function is. It is hence “distribution free.” You get uncertainty quantification for nothing.

But there’s no free lunch. Let’s unpack the promises.

First and foremost, all of the beautiful rigor relies on the assumption that past data is exchangeable with the future. We are drawing logical conclusions under a fanciful, unverifiable hypothesis. I can also derive rigorous theorems about mechanics, assuming the Earth is flat and the center of the universe. There are diminishing returns of rigor when the foundational assumptions are untestable and implausible.

Second, we should be worried that you get the same probabilistic guarantee regardless of the scoring function. Conformal prediction almost invites you to use garbage prediction functions. You’ll get the same coverage guarantees for a carefully fine-tuned transformer model as you’ll get for a literally random function. This leads to sloppy reasoning, where we spend too much time proving the wrong theorems. Yesterday, I discussed a paper that built prediction functions that were supposed to be upper and lower approximations of a regression function. The paper guarantees conformal coverage but doesn’t guarantee that the upper and lower bound functions don’t cross.

Let me grant you iid data and assume you really want to calibrate your predictor using conformal techniques. The third reason the conformal guarantees are misleading is that they are conflating the probability the algorithm is correct with the probability the prediction is correct. Conformal prediction does not tell you the probability the future is in the prediction set is 95%. It tells you that the probability that the algorithm returns a prediction set containing the future is 95%. These Monte Carlo guarantees are very different!

Indeed, we could do this: I can ask “what is the probability the algorithm returns a set that covers the future with probability 1-delta?” this is a different question. And it has a very precise answer. Vovk derives that this probability is equal to

\(\text{betainc}_{1-\delta}(n-\lfloor \alpha(n+1)-1\rfloor, \lfloor \alpha(n+1) -1 \rfloor)\)

Here betainc is the regularized incomplete beta function. Don’t worry too much about the details of that function.2 Let me just tell you what this implies. I remarked that the conformal guarantees above looked like they were independent of n. With 2 data points, you get a 95% coverage guarantee. That seems too good to be true. Either that or the guarantee must be pretty vacuous. Vovk’s formula tells us a different story. If I really want to guarantee that with probability 99.999% on my calibration set that the future is covered with probability between 94% and 96%, then I need at least 8,000 points in my calibration set. I’m going to repeat myself because it’s so surprising: you need nearly 10,000 independent and identically distributed samples to calibrate one prediction band.

10,000 is a lot! And I can get you the same guarantee from the non-conformal algorithm theorem I presented yesterday based on the boring, 70-year-old DKW inequality. In many ways, using the DKW inequality gives even stronger guarantees. Conformal prediction needs 10,000 samples to calibrate a 95% prediction interval. The DKW inequality says that for 60,000 iid samples, you get precise coverage bands for every alpha level between 1% and 99%. If you happen to be one of these people with access to tens of thousands of iid samples, why not use a few more and get even better uncertainty quantification?

If you believe your data is iid, and you can get tens of thousands of data points, and you think your predictor is good, you can use conformal prediction to quantify prediction uncertainty. But the worst part is I think they are quantifying uncertainty that no one actually cares about.

From the comments and from Twitter and from talking with people IRL, I know that most want a prediction band where no matter what x they plug in, C(x) likely contains y. This sort of coverage guarantee, where you get to pick the x, is called conditional coverage. Conformal prediction cannot give you this. Conformal coverage theorems only hold if (x,y) comes from the same distribution as the calibration set. Coverage where you’re just making predictions about the data streaming at you is called marginal coverage.

I’ll talk about this more on Monday, but most of the applications I’ve seen where you want prediction bands, you need conditional coverage. If you want to act to change the predicted outcome, marginal can’t cut it. And there are strong theorems that tell us that it is impossible to have distribution-free conditional coverage. I said it already, and I’ll say it again: there is indeed no free lunch. If you want to make better predictions, you’ll have to build better models.

F is a fudge factor slightly larger than 1.

\(F= \frac{\lceil (1-\alpha) (n+1) \rceil}{(1-\alpha) n}\)

For my inequality grinders out there, this function is the cumulative distribution function of the binomial distribution. It has excellent Chernoff bound approximations in case you want to prove some more theory.

arg min

Discussion about this post