20 Comments
User's avatar
Tony Robbins's avatar

Predict_addict’s head must be exploding reading this.

Expand full comment
Ben Recht's avatar

He didn't seem to happy with me.

Twitter remains a weird place.

Expand full comment
Valeriy Manokhin's avatar

Anything of a substance you can add? Guess not.

Expand full comment
Hostile Replicator's avatar

Do you genuinely want a discussion/response? If so, why not post your critiques here, rather than as a series of Nassim Taleb-esque Twitter rants?

Expand full comment
Valeriy Manokhin's avatar

Ironic comment from some anon calling himself “Hostile replicator” and considering that Twitter is where the OP posted his nonsense in the first place.

Expand full comment
Alex Balinsky's avatar

Really enjoyed this post! Argmin now is my must read!

Expand full comment
Ben Recht's avatar

Thank you!

Expand full comment
Harris Papadopoulos's avatar

This post is far from being objective.

"First and foremost, all of the beautiful rigor relies on the assumption that past data is exchangeable with the future. We are drawing logical conclusions under a fanciful, unverifiable hypothesis."

There are ways of verifying exchangeability. If someone uses a tool wrongly, that's not the tool's fault.

"Second, we should be worried that you get the same probabilistic guarantee regardless of the scoring function. Conformal prediction almost invites you to use garbage prediction functions."

You get the same guarantee, but the prediction regions are useless. Why should we be worried? One could say the same for all Machine Learning techniques, they provide you with predictions even if you use garbage data.

Expand full comment
Adhyyan Narang's avatar

"Conformal prediction does not tell you the probability the future is in the prediction set is 95%. It tells you that the probability that the algorithm returns a prediction set containing the future is 95%. These Monte Carlo guarantees are very different!"

Could you clarify what the difference between these two events is? They are sounding exactly the same to me. Is the stochasticity due to different sources in the two cases - if so, what are these?

Expand full comment
Ben Recht's avatar

Yes, this took me forever to figure out as well. Let me unpack it here:

Assume z is a random variable that takes values in Z. Let D be the distribution. For any subset S of Z, the coverage of S is just the measure of S with respect to D.

Desiderata 1: Fix alpha, delta, and epsilon. Design an algorithm that takes n samples from D and returns Sn. Guarantee that with probability 1-delta on the first n samples that the measure(Sn) is in [1-alpha-epsilon, 1-alpha+epsilon].

Desiderata 2: You now have n+1 samples. Design an algorithm D that takes the first n samples and returns Sn, guarantee that with probability 1-alpha that z_{n+1} in Sn. This probability is measured with respect to the full draw of n+1 samples.

Desiderata 2 blends the two probabilities in Desiderata 1. Does that make sense?

Expand full comment
Adhyyan Narang's avatar

Yes I see the difference. So you're saying that CP ensures desiderata 2 and not 1. I'm still trying to wrap my head around why #2 is much easier than #1.

Why is it possible to satisfy #2 with just 2 held-out examples but not #1?

Expand full comment
Ben Recht's avatar

it's not that you can't achieve both, it's just that the guarantees look different. Take the standard conformal method. Most proofs guarantee

1-alpha <= Pr[y in C(x) ] <= 1-alpha +1/(n+1)

But Vovk shows the same algorithm gets the guarantee "with probability 1-delta, measure(Sn) is p" where p is that incomplete beta expression I provide in the post.

Expand full comment
Adhyyan Narang's avatar

Is it because in the latter you are in a "transductive" case where you can tailor your prediction set to the unlabeled example, but the former case is more "inductive" and needs to work for any unlabeled example?

Expand full comment
Zach's avatar

"Conformal prediction almost invites you to use garbage prediction functions. You’ll get the same coverage guarantees for a carefully fine-tuned transformer model as you’ll get for a literally random function."

This is true, but your prediction intervals will vary in size depending on the performance of the underlying model.

Expand full comment
Ben Recht's avatar

We all hope this is true, but coverage guarantees give no guarantees of interval size.

Expand full comment
Valeriy Manokhin's avatar

This is false because there is research showing how to estimate interval size, but again you are so clueless about conformal prediction that you are just throwing empty claims.

Expand full comment
Alex Balinsky's avatar

Valery, the person who is throwing empty claims here is you. It's look like you are so clueless about what is scientific discussions.

Expand full comment
Zach's avatar

Sure, but you need one free variable. You can either hold coverage frequency or interval size constant. People seem to get more value out of constant coverage frequency than interval size.

Expand full comment
Adhyyan Narang's avatar

This paper by Lei et.al says some things about the prediction widths. https://arxiv.org/abs/1604.04173

But for some reason the future literature focuses exclusively on coverage guarantees and not on width guarantees. I'm not sure what the reason for this is - if it's that providing width guarantees are much harder, I'd be curious about the intuition for why.

Expand full comment
Matteo Fontana's avatar

Very interesting post, Ben.

I do not share your concerns with respect to the centrality of the IID assumption. I'd say that having data that are at least conditionally independent and identically distributed is a prerequisite for any kind of empirical modelling.

Empirics, and so the inductive way of reasoning that you use when you build a quantitative model to predict something does require to have some regularities in the dgp that you can exploit to produce these forecasts.

In the easiest cases you have "pure" iid-ness, if you need to add "structure" to your modelling, that's where the conditional part comes in.

I get your point about "proving the wrong theorems", but I do not think I share it. In fact, I think it is one of the most interesting selling points of CP. At the same time, I agree with you that some additional work with respect to the size guarantees one gets when using CP should be in order. My feeling (but I really should study a bit more... :) ) is that exploring the intersection between scoring rules a-la-Gneiting (either univariate or multivariate) and NCMs could be an interesting piece of research.

The part I share, and I find quite fascinating about your post (and, to be fair, I did not grasp after reading Vovk's paper) is indeed that apparently there are some parts of CP that are, for the lack of a better word, "asymptotic". Which hinders (and should be raised as a caveat...) the use of CP in my own use cases, where CP is used to produce prediction regions with coverage guarantees using simple, "statistical" models, but used on high dimensional/complex outputs.

I believe this latter point to be very interesting in general from a theoretical perspective, but having seen where part of the research of CP is going (which is... let's provide UQ for very complex models in a big data setting... see e.g. all the work by the Stanford guys), probably a bit orthogonal with what current research is doing.

In any case, happy to discuss and share views... Or alternative approaches, since it is not super-easy to find out there non-parametric methodologies for distributional prediction.

Expand full comment