23 Comments
User's avatar
Tony Robbins's avatar

Predict_addict’s head must be exploding reading this.

Ben Recht's avatar

He didn't seem to happy with me.

Twitter remains a weird place.

Valeriy Manokhin's avatar

Anything of a substance you can add? Guess not.

Hostile Replicator's avatar

Do you genuinely want a discussion/response? If so, why not post your critiques here, rather than as a series of Nassim Taleb-esque Twitter rants?

Valeriy Manokhin's avatar

Ironic comment from some anon calling himself “Hostile replicator” and considering that Twitter is where the OP posted his nonsense in the first place.

Alex Balinsky's avatar

Really enjoyed this post! Argmin now is my must read!

Harris Papadopoulos's avatar

This post is far from being objective.

"First and foremost, all of the beautiful rigor relies on the assumption that past data is exchangeable with the future. We are drawing logical conclusions under a fanciful, unverifiable hypothesis."

There are ways of verifying exchangeability. If someone uses a tool wrongly, that's not the tool's fault.

"Second, we should be worried that you get the same probabilistic guarantee regardless of the scoring function. Conformal prediction almost invites you to use garbage prediction functions."

You get the same guarantee, but the prediction regions are useless. Why should we be worried? One could say the same for all Machine Learning techniques, they provide you with predictions even if you use garbage data.

Adhyyan Narang's avatar

"Conformal prediction does not tell you the probability the future is in the prediction set is 95%. It tells you that the probability that the algorithm returns a prediction set containing the future is 95%. These Monte Carlo guarantees are very different!"

Could you clarify what the difference between these two events is? They are sounding exactly the same to me. Is the stochasticity due to different sources in the two cases - if so, what are these?

Ben Recht's avatar

Yes, this took me forever to figure out as well. Let me unpack it here:

Assume z is a random variable that takes values in Z. Let D be the distribution. For any subset S of Z, the coverage of S is just the measure of S with respect to D.

Desiderata 1: Fix alpha, delta, and epsilon. Design an algorithm that takes n samples from D and returns Sn. Guarantee that with probability 1-delta on the first n samples that the measure(Sn) is in [1-alpha-epsilon, 1-alpha+epsilon].

Desiderata 2: You now have n+1 samples. Design an algorithm D that takes the first n samples and returns Sn, guarantee that with probability 1-alpha that z_{n+1} in Sn. This probability is measured with respect to the full draw of n+1 samples.

Desiderata 2 blends the two probabilities in Desiderata 1. Does that make sense?

Adhyyan Narang's avatar

Yes I see the difference. So you're saying that CP ensures desiderata 2 and not 1. I'm still trying to wrap my head around why #2 is much easier than #1.

Why is it possible to satisfy #2 with just 2 held-out examples but not #1?

Ben Recht's avatar

it's not that you can't achieve both, it's just that the guarantees look different. Take the standard conformal method. Most proofs guarantee

1-alpha <= Pr[y in C(x) ] <= 1-alpha +1/(n+1)

But Vovk shows the same algorithm gets the guarantee "with probability 1-delta, measure(Sn) is p" where p is that incomplete beta expression I provide in the post.

Adhyyan Narang's avatar

Is it because in the latter you are in a "transductive" case where you can tailor your prediction set to the unlabeled example, but the former case is more "inductive" and needs to work for any unlabeled example?

Zach's avatar

"Conformal prediction almost invites you to use garbage prediction functions. You’ll get the same coverage guarantees for a carefully fine-tuned transformer model as you’ll get for a literally random function."

This is true, but your prediction intervals will vary in size depending on the performance of the underlying model.

Ben Recht's avatar

We all hope this is true, but coverage guarantees give no guarantees of interval size.

Valeriy Manokhin's avatar

This is false because there is research showing how to estimate interval size, but again you are so clueless about conformal prediction that you are just throwing empty claims.

Alex Balinsky's avatar

Valery, the person who is throwing empty claims here is you. It's look like you are so clueless about what is scientific discussions.

Zach's avatar

Sure, but you need one free variable. You can either hold coverage frequency or interval size constant. People seem to get more value out of constant coverage frequency than interval size.

Adhyyan Narang's avatar

This paper by Lei et.al says some things about the prediction widths. https://arxiv.org/abs/1604.04173

But for some reason the future literature focuses exclusively on coverage guarantees and not on width guarantees. I'm not sure what the reason for this is - if it's that providing width guarantees are much harder, I'd be curious about the intuition for why.

Valeriy Manokhin's avatar

Your "Cover Songs" post is well-written, but the real scam isn't conformal prediction — it's dressing up known trade-offs as a grand exposé of "mathematical deceit."

The algorithm is simple: a quantile on calibration scores. That's not hidden or scandalous; it's the entire point. The value lies in the clean finite-sample marginal coverage guarantee (P(Y_test ∈ C(X_test)) ≥ 1-α) that holds under exchangeability without assuming your model is correct or the data follows any particular distribution. Calling this elegant result "deceit" or "cosplay" is theatrical overkill.

Exchangeability is the same minimal assumption behind virtually every confidence interval and statistical procedure that reuses past data for future inference. Labeling it "fanciful" and "unverifiable" as if it's uniquely damning for CP is misleading. In practice people check for obvious violations with diagnostics, and extensions (weighted CP, adaptive methods) exist for shift. The flat-Earth analogy is cute but doesn't survive contact with how statistics actually works.

The coverage critique is the slipperiest part. Standard CP delivers exactly the marginal guarantee it promises, for any calibration size. Demanding instead a strong training-conditional guarantee with 99.999% probability over the calibration set (±1% deviation) and then declaring failure when it needs ~10k points is classic goalpost-moving. The literature has always been clear about marginal vs. conditional coverage and about how concentration improves with n. Marginal coverage is still practically valuable for safety layers, reliability statements, and applications where you want guaranteed validity even if the underlying model is mediocre.

Complaining that "it invites garbage scores" misses the separation of validity and efficiency that every CP reference emphasizes. Bad scores give conservative sets; good ones (normalized residuals, learned nonconformity, etc.) give tight adaptive sets. The guarantee holding anyway is a feature for safety-critical work.

Yes, marginal isn't conditional, distribution shift breaks things, and some hype oversells it. Those are fair points worth discussing. But turning them into a takedown claiming the whole thing is vacuous "cover songs" does more for engagement than accuracy.

Conformal prediction isn't revolutionary magic. It's a practical, rigorously grounded tool for adding finite-sample validity to black-box predictions. It has limits, like every method. It is not a scam.

The field benefits from honest critique of limitations. It benefits less from dramatic obituaries that overreach.

We'll keep using (and extending) CP where it adds value. Perhaps next time engage the actual theorems and community responses instead of performing the “lone voice” drama against the robust statistical framework.

Valeriy Manokhin's avatar

commenting on 'The third reason the conformal guarantees are misleading is that they are conflating the probability the algorithm is correct with the probability the prediction is correct. Conformal prediction does not tell you the probability the future is in the prediction set is 95%. It tells you that the probability that the algorithm returns a prediction set containing the future is 95%.'

This misunderstanding largely comes from low-quality Berkeley tutorials, not from the core literature.

In fact, Vovk’s original paper makes it clear that the results based on the beta distribution are correct.

The claim that 500 points in the calibration set are sufficient only appears in a Berkeley tutorial—and it is false.

No serious textbook or research paper on conformal prediction repeats this claim.

That reflects more on the quality of Berkeley’s tutorial than on conformal prediction itself.

And now DKW does not provide better bounds either https://medium.com/@peterzwart_44168/finite-sample-guarantees-in-conformal-prediction-from-dkw-to-beta-binomial-577228b313fe

Valeriy Manokhin's avatar

'First and foremost, all of the beautiful rigor relies on the assumption that past data is exchangeable with the future. We are drawing logical conclusions under a fanciful, unverifiable hypothesis. I can also derive rigorous theorems about mechanics, assuming the Earth is flat and the center of the universe. There are diminishing returns of rigor when the foundational assumptions are untestable and implausible.' this did not age well as exchangeabiity is not required for conformal prediction.

https://arxiv.org/abs/2405.06627

Matteo Fontana's avatar

Very interesting post, Ben.

I do not share your concerns with respect to the centrality of the IID assumption. I'd say that having data that are at least conditionally independent and identically distributed is a prerequisite for any kind of empirical modelling.

Empirics, and so the inductive way of reasoning that you use when you build a quantitative model to predict something does require to have some regularities in the dgp that you can exploit to produce these forecasts.

In the easiest cases you have "pure" iid-ness, if you need to add "structure" to your modelling, that's where the conditional part comes in.

I get your point about "proving the wrong theorems", but I do not think I share it. In fact, I think it is one of the most interesting selling points of CP. At the same time, I agree with you that some additional work with respect to the size guarantees one gets when using CP should be in order. My feeling (but I really should study a bit more... :) ) is that exploring the intersection between scoring rules a-la-Gneiting (either univariate or multivariate) and NCMs could be an interesting piece of research.

The part I share, and I find quite fascinating about your post (and, to be fair, I did not grasp after reading Vovk's paper) is indeed that apparently there are some parts of CP that are, for the lack of a better word, "asymptotic". Which hinders (and should be raised as a caveat...) the use of CP in my own use cases, where CP is used to produce prediction regions with coverage guarantees using simple, "statistical" models, but used on high dimensional/complex outputs.

I believe this latter point to be very interesting in general from a theoretical perspective, but having seen where part of the research of CP is going (which is... let's provide UQ for very complex models in a big data setting... see e.g. all the work by the Stanford guys), probably a bit orthogonal with what current research is doing.

In any case, happy to discuss and share views... Or alternative approaches, since it is not super-easy to find out there non-parametric methodologies for distributional prediction.