Ironic comment from some anon calling himself “Hostile replicator” and considering that Twitter is where the OP posted his nonsense in the first place.
"First and foremost, all of the beautiful rigor relies on the assumption that past data is exchangeable with the future. We are drawing logical conclusions under a fanciful, unverifiable hypothesis."
There are ways of verifying exchangeability. If someone uses a tool wrongly, that's not the tool's fault.
"Second, we should be worried that you get the same probabilistic guarantee regardless of the scoring function. Conformal prediction almost invites you to use garbage prediction functions."
You get the same guarantee, but the prediction regions are useless. Why should we be worried? One could say the same for all Machine Learning techniques, they provide you with predictions even if you use garbage data.
"Conformal prediction does not tell you the probability the future is in the prediction set is 95%. It tells you that the probability that the algorithm returns a prediction set containing the future is 95%. These Monte Carlo guarantees are very different!"
Could you clarify what the difference between these two events is? They are sounding exactly the same to me. Is the stochasticity due to different sources in the two cases - if so, what are these?
Yes, this took me forever to figure out as well. Let me unpack it here:
Assume z is a random variable that takes values in Z. Let D be the distribution. For any subset S of Z, the coverage of S is just the measure of S with respect to D.
Desiderata 1: Fix alpha, delta, and epsilon. Design an algorithm that takes n samples from D and returns Sn. Guarantee that with probability 1-delta on the first n samples that the measure(Sn) is in [1-alpha-epsilon, 1-alpha+epsilon].
Desiderata 2: You now have n+1 samples. Design an algorithm D that takes the first n samples and returns Sn, guarantee that with probability 1-alpha that z_{n+1} in Sn. This probability is measured with respect to the full draw of n+1 samples.
Desiderata 2 blends the two probabilities in Desiderata 1. Does that make sense?
Yes I see the difference. So you're saying that CP ensures desiderata 2 and not 1. I'm still trying to wrap my head around why #2 is much easier than #1.
Why is it possible to satisfy #2 with just 2 held-out examples but not #1?
it's not that you can't achieve both, it's just that the guarantees look different. Take the standard conformal method. Most proofs guarantee
1-alpha <= Pr[y in C(x) ] <= 1-alpha +1/(n+1)
But Vovk shows the same algorithm gets the guarantee "with probability 1-delta, measure(Sn) is p" where p is that incomplete beta expression I provide in the post.
Is it because in the latter you are in a "transductive" case where you can tailor your prediction set to the unlabeled example, but the former case is more "inductive" and needs to work for any unlabeled example?
"Conformal prediction almost invites you to use garbage prediction functions. You’ll get the same coverage guarantees for a carefully fine-tuned transformer model as you’ll get for a literally random function."
This is true, but your prediction intervals will vary in size depending on the performance of the underlying model.
This is false because there is research showing how to estimate interval size, but again you are so clueless about conformal prediction that you are just throwing empty claims.
Sure, but you need one free variable. You can either hold coverage frequency or interval size constant. People seem to get more value out of constant coverage frequency than interval size.
But for some reason the future literature focuses exclusively on coverage guarantees and not on width guarantees. I'm not sure what the reason for this is - if it's that providing width guarantees are much harder, I'd be curious about the intuition for why.
I do not share your concerns with respect to the centrality of the IID assumption. I'd say that having data that are at least conditionally independent and identically distributed is a prerequisite for any kind of empirical modelling.
Empirics, and so the inductive way of reasoning that you use when you build a quantitative model to predict something does require to have some regularities in the dgp that you can exploit to produce these forecasts.
In the easiest cases you have "pure" iid-ness, if you need to add "structure" to your modelling, that's where the conditional part comes in.
I get your point about "proving the wrong theorems", but I do not think I share it. In fact, I think it is one of the most interesting selling points of CP. At the same time, I agree with you that some additional work with respect to the size guarantees one gets when using CP should be in order. My feeling (but I really should study a bit more... :) ) is that exploring the intersection between scoring rules a-la-Gneiting (either univariate or multivariate) and NCMs could be an interesting piece of research.
The part I share, and I find quite fascinating about your post (and, to be fair, I did not grasp after reading Vovk's paper) is indeed that apparently there are some parts of CP that are, for the lack of a better word, "asymptotic". Which hinders (and should be raised as a caveat...) the use of CP in my own use cases, where CP is used to produce prediction regions with coverage guarantees using simple, "statistical" models, but used on high dimensional/complex outputs.
I believe this latter point to be very interesting in general from a theoretical perspective, but having seen where part of the research of CP is going (which is... let's provide UQ for very complex models in a big data setting... see e.g. all the work by the Stanford guys), probably a bit orthogonal with what current research is doing.
In any case, happy to discuss and share views... Or alternative approaches, since it is not super-easy to find out there non-parametric methodologies for distributional prediction.
Predict_addict’s head must be exploding reading this.
He didn't seem to happy with me.
Twitter remains a weird place.
Anything of a substance you can add? Guess not.
Do you genuinely want a discussion/response? If so, why not post your critiques here, rather than as a series of Nassim Taleb-esque Twitter rants?
Ironic comment from some anon calling himself “Hostile replicator” and considering that Twitter is where the OP posted his nonsense in the first place.
Really enjoyed this post! Argmin now is my must read!
Thank you!
This post is far from being objective.
"First and foremost, all of the beautiful rigor relies on the assumption that past data is exchangeable with the future. We are drawing logical conclusions under a fanciful, unverifiable hypothesis."
There are ways of verifying exchangeability. If someone uses a tool wrongly, that's not the tool's fault.
"Second, we should be worried that you get the same probabilistic guarantee regardless of the scoring function. Conformal prediction almost invites you to use garbage prediction functions."
You get the same guarantee, but the prediction regions are useless. Why should we be worried? One could say the same for all Machine Learning techniques, they provide you with predictions even if you use garbage data.
"Conformal prediction does not tell you the probability the future is in the prediction set is 95%. It tells you that the probability that the algorithm returns a prediction set containing the future is 95%. These Monte Carlo guarantees are very different!"
Could you clarify what the difference between these two events is? They are sounding exactly the same to me. Is the stochasticity due to different sources in the two cases - if so, what are these?
Yes, this took me forever to figure out as well. Let me unpack it here:
Assume z is a random variable that takes values in Z. Let D be the distribution. For any subset S of Z, the coverage of S is just the measure of S with respect to D.
Desiderata 1: Fix alpha, delta, and epsilon. Design an algorithm that takes n samples from D and returns Sn. Guarantee that with probability 1-delta on the first n samples that the measure(Sn) is in [1-alpha-epsilon, 1-alpha+epsilon].
Desiderata 2: You now have n+1 samples. Design an algorithm D that takes the first n samples and returns Sn, guarantee that with probability 1-alpha that z_{n+1} in Sn. This probability is measured with respect to the full draw of n+1 samples.
Desiderata 2 blends the two probabilities in Desiderata 1. Does that make sense?
Yes I see the difference. So you're saying that CP ensures desiderata 2 and not 1. I'm still trying to wrap my head around why #2 is much easier than #1.
Why is it possible to satisfy #2 with just 2 held-out examples but not #1?
it's not that you can't achieve both, it's just that the guarantees look different. Take the standard conformal method. Most proofs guarantee
1-alpha <= Pr[y in C(x) ] <= 1-alpha +1/(n+1)
But Vovk shows the same algorithm gets the guarantee "with probability 1-delta, measure(Sn) is p" where p is that incomplete beta expression I provide in the post.
Is it because in the latter you are in a "transductive" case where you can tailor your prediction set to the unlabeled example, but the former case is more "inductive" and needs to work for any unlabeled example?
"Conformal prediction almost invites you to use garbage prediction functions. You’ll get the same coverage guarantees for a carefully fine-tuned transformer model as you’ll get for a literally random function."
This is true, but your prediction intervals will vary in size depending on the performance of the underlying model.
We all hope this is true, but coverage guarantees give no guarantees of interval size.
This is false because there is research showing how to estimate interval size, but again you are so clueless about conformal prediction that you are just throwing empty claims.
Valery, the person who is throwing empty claims here is you. It's look like you are so clueless about what is scientific discussions.
Sure, but you need one free variable. You can either hold coverage frequency or interval size constant. People seem to get more value out of constant coverage frequency than interval size.
This paper by Lei et.al says some things about the prediction widths. https://arxiv.org/abs/1604.04173
But for some reason the future literature focuses exclusively on coverage guarantees and not on width guarantees. I'm not sure what the reason for this is - if it's that providing width guarantees are much harder, I'd be curious about the intuition for why.
Very interesting post, Ben.
I do not share your concerns with respect to the centrality of the IID assumption. I'd say that having data that are at least conditionally independent and identically distributed is a prerequisite for any kind of empirical modelling.
Empirics, and so the inductive way of reasoning that you use when you build a quantitative model to predict something does require to have some regularities in the dgp that you can exploit to produce these forecasts.
In the easiest cases you have "pure" iid-ness, if you need to add "structure" to your modelling, that's where the conditional part comes in.
I get your point about "proving the wrong theorems", but I do not think I share it. In fact, I think it is one of the most interesting selling points of CP. At the same time, I agree with you that some additional work with respect to the size guarantees one gets when using CP should be in order. My feeling (but I really should study a bit more... :) ) is that exploring the intersection between scoring rules a-la-Gneiting (either univariate or multivariate) and NCMs could be an interesting piece of research.
The part I share, and I find quite fascinating about your post (and, to be fair, I did not grasp after reading Vovk's paper) is indeed that apparently there are some parts of CP that are, for the lack of a better word, "asymptotic". Which hinders (and should be raised as a caveat...) the use of CP in my own use cases, where CP is used to produce prediction regions with coverage guarantees using simple, "statistical" models, but used on high dimensional/complex outputs.
I believe this latter point to be very interesting in general from a theoretical perspective, but having seen where part of the research of CP is going (which is... let's provide UQ for very complex models in a big data setting... see e.g. all the work by the Stanford guys), probably a bit orthogonal with what current research is doing.
In any case, happy to discuss and share views... Or alternative approaches, since it is not super-easy to find out there non-parametric methodologies for distributional prediction.