Cover Songs

Ben Recht

Feb 9, 2024

You can't always get what you want

Read →

20 Comments

Tony Robbins

Feb 11, 2024

Predict_addict’s head must be exploding reading this.

Expand full comment

Reply (2)

Ben Recht

Feb 11, 2024Edited

He didn't seem to happy with me.

Twitter remains a weird place.

Expand full comment

Valeriy Manokhin

Feb 11, 2024

Anything of a substance you can add? Guess not.

Expand full comment

Reply (1)

Hostile Replicator

Feb 12, 2024

Do you genuinely want a discussion/response? If so, why not post your critiques here, rather than as a series of Nassim Taleb-esque Twitter rants?

Expand full comment

Reply (1)

Valeriy Manokhin

Feb 16, 2024

Ironic comment from some anon calling himself “Hostile replicator” and considering that Twitter is where the OP posted his nonsense in the first place.

Expand full comment

Alex Balinsky

Apr 1, 2024

Really enjoyed this post! Argmin now is my must read!

Expand full comment

Reply (1)

Ben Recht

Apr 2, 2024

Thank you!

Expand full comment

Harris Papadopoulos

Apr 9, 2024

This post is far from being objective.

"First and foremost, all of the beautiful rigor relies on the assumption that past data is exchangeable with the future. We are drawing logical conclusions under a fanciful, unverifiable hypothesis."

There are ways of verifying exchangeability. If someone uses a tool wrongly, that's not the tool's fault.

"Second, we should be worried that you get the same probabilistic guarantee regardless of the scoring function. Conformal prediction almost invites you to use garbage prediction functions."

You get the same guarantee, but the prediction regions are useless. Why should we be worried? One could say the same for all Machine Learning techniques, they provide you with predictions even if you use garbage data.

Expand full comment

Adhyyan Narang

Feb 11, 2024

"Conformal prediction does not tell you the probability the future is in the prediction set is 95%. It tells you that the probability that the algorithm returns a prediction set containing the future is 95%. These Monte Carlo guarantees are very different!"

Could you clarify what the difference between these two events is? They are sounding exactly the same to me. Is the stochasticity due to different sources in the two cases - if so, what are these?

Expand full comment

Reply (1)

Ben Recht

Feb 11, 2024

Yes, this took me forever to figure out as well. Let me unpack it here:

Assume z is a random variable that takes values in Z. Let D be the distribution. For any subset S of Z, the coverage of S is just the measure of S with respect to D.

Desiderata 1: Fix alpha, delta, and epsilon. Design an algorithm that takes n samples from D and returns Sn. Guarantee that with probability 1-delta on the first n samples that the measure(Sn) is in [1-alpha-epsilon, 1-alpha+epsilon].

Desiderata 2: You now have n+1 samples. Design an algorithm D that takes the first n samples and returns Sn, guarantee that with probability 1-alpha that z_{n+1} in Sn. This probability is measured with respect to the full draw of n+1 samples.

Desiderata 2 blends the two probabilities in Desiderata 1. Does that make sense?

Expand full comment

Reply (1)

Adhyyan Narang

Feb 11, 2024

Yes I see the difference. So you're saying that CP ensures desiderata 2 and not 1. I'm still trying to wrap my head around why #2 is much easier than #1.

Why is it possible to satisfy #2 with just 2 held-out examples but not #1?

Expand full comment

Reply (2)

Ben Recht

Feb 11, 2024

it's not that you can't achieve both, it's just that the guarantees look different. Take the standard conformal method. Most proofs guarantee

1-alpha <= Pr[y in C(x) ] <= 1-alpha +1/(n+1)

But Vovk shows the same algorithm gets the guarantee "with probability 1-delta, measure(Sn) is p" where p is that incomplete beta expression I provide in the post.

Expand full comment

Adhyyan Narang

Feb 11, 2024

Is it because in the latter you are in a "transductive" case where you can tailor your prediction set to the unlabeled example, but the former case is more "inductive" and needs to work for any unlabeled example?

Expand full comment

Zach

Feb 11, 2024

"Conformal prediction almost invites you to use garbage prediction functions. You’ll get the same coverage guarantees for a carefully fine-tuned transformer model as you’ll get for a literally random function."

This is true, but your prediction intervals will vary in size depending on the performance of the underlying model.

Expand full comment

Reply (1)

Ben Recht

Feb 11, 2024

We all hope this is true, but coverage guarantees give no guarantees of interval size.

Expand full comment

Reply (3)

Valeriy Manokhin

Apr 27, 2024

This is false because there is research showing how to estimate interval size, but again you are so clueless about conformal prediction that you are just throwing empty claims.

Expand full comment

Reply (1)

Alex Balinsky

Aug 23

Valery, the person who is throwing empty claims here is you. It's look like you are so clueless about what is scientific discussions.

Expand full comment

Zach

Feb 12, 2024

Sure, but you need one free variable. You can either hold coverage frequency or interval size constant. People seem to get more value out of constant coverage frequency than interval size.

Expand full comment

Adhyyan Narang

Feb 11, 2024

This paper by Lei et.al says some things about the prediction widths. https://arxiv.org/abs/1604.04173

But for some reason the future literature focuses exclusively on coverage guarantees and not on width guarantees. I'm not sure what the reason for this is - if it's that providing width guarantees are much harder, I'd be curious about the intuition for why.

Expand full comment

Matteo Fontana

Apr 17, 2024

Very interesting post, Ben.

I do not share your concerns with respect to the centrality of the IID assumption. I'd say that having data that are at least conditionally independent and identically distributed is a prerequisite for any kind of empirical modelling.

Empirics, and so the inductive way of reasoning that you use when you build a quantitative model to predict something does require to have some regularities in the dgp that you can exploit to produce these forecasts.

In the easiest cases you have "pure" iid-ness, if you need to add "structure" to your modelling, that's where the conditional part comes in.

I get your point about "proving the wrong theorems", but I do not think I share it. In fact, I think it is one of the most interesting selling points of CP. At the same time, I agree with you that some additional work with respect to the size guarantees one gets when using CP should be in order. My feeling (but I really should study a bit more... :) ) is that exploring the intersection between scoring rules a-la-Gneiting (either univariate or multivariate) and NCMs could be an interesting piece of research.

The part I share, and I find quite fascinating about your post (and, to be fair, I did not grasp after reading Vovk's paper) is indeed that apparently there are some parts of CP that are, for the lack of a better word, "asymptotic". Which hinders (and should be raised as a caveat...) the use of CP in my own use cases, where CP is used to produce prediction regions with coverage guarantees using simple, "statistical" models, but used on high dimensional/complex outputs.

I believe this latter point to be very interesting in general from a theoretical perspective, but having seen where part of the research of CP is going (which is... let's provide UQ for very complex models in a big data setting... see e.g. all the work by the Stanford guys), probably a bit orthogonal with what current research is doing.

In any case, happy to discuss and share views... Or alternative approaches, since it is not super-easy to find out there non-parametric methodologies for distributional prediction.

Expand full comment

arg min

Cover Songs