Still superficial—and linking to already debunked posts adds nothing. The only credible point concerns sample size for guarantees; the ‘500 points are sufficient’ claim seems limited to a few Berkeley commenters. As the proverb says, ‘The dog barks, but the caravan moves on.’ Notably, Michael Jordan and Emmanuel Candès are strong supporters of conformal prediction.
How are you, my friend, the Predict Addict! I knew this would bring your delightful personality to my webpage. It's been far too long since we've spoken. How is life treating the world's favorite conformal prediction influencer?
Could you please expand on "But the theorems still only hold if your data is sampled from a distribution"? Your data collection mechanism defines the distribution. If it's something like "take the first 10 subjects/objects I see" you might lose exchangeability, but you still have a distribution over those 10. If I have a sampling frame over a defined and stable list and I choose a subset using a "random" method then I get a bunch of other properties, too. (Like an external "population" distribution)
I understand the utility of calling "a sampling frame over a defined and stable list and I choose a subset using a "random" method" a distribution. But who does it help to pretend that available convenience samples come from a distribution?
It can be useful in subsampling/permutation analyses, in that case, as in Fisher's Cowpea example. We may be using the term "distribution" differently I think of distribution as a fancy name for describing a collection of values. from which I can examine subsets and compare functions on those sets. Would that be the actuarial approach you alluded to earlier? I've rarely participated in actual "real" sampling and have mostly seen convenience/haphazard sets of data.
Your comment weirdly got quadruple-posted. I deleted the copies.
Perhaps my phrasing in the post is taking too much poetic liberty for something that is marginally precise. For me, a distribution is a function that maps events into [0,1]. And when I say data is from a distribution, it means that there is some function like this, which we can describe, and we can describe the sigma algebra, and we can describe what associated random variables are. Sometimes models like this are useful to describe reality, but usually we skip over all the technical details to check whether if we can specify or verify any of them.
As. noted above, I take a bottom-up approach, starting with discrete objects, and defining the distribution from them. Since I'm dealing with finite numbers of objects, and the measurements on them, I think in terms of power sets rather than sigma algebras. Continuous distributions come in as approximations. Frequencies as a sub-type of probabilities.
"And even when you do have a probability distribution from which you sample, you have to deal with the nebulous nature of the ex ante guarantee. Most conformal prediction guarantees are deeply misleading. They assert that the probability a new observation falls outside of your predicted set is 95%. But that probability is over the new event and the training set. It’s an ex ante guarantee, not an ex post guarantee."
Why is this a problem? When predicting, I care about the error on new observations. If I didn't care about new observations, I wouldn't bother with predicting them, right?
Because the guarantee is not about the future. It's a probability measured with respect to the future AND your training data. "The probability that my training data returns an interval that contains the next sample." Is very different than first using your training data to construct an interval I and then computing "The probability that the next sample is in the computed interval."
What do we actually mean when we say "there is no distribution?" Does it mean "we don't know the distribution", "the data generating process is not homogeneous across the controlled parameter space (e.g time)" or something else?
I mean that modeling data as sampled from a probability distribution such which is invariant over all permutations is almost never justifiable.
I'm realizing this post needed to be a lot longer to fill in all of the details. But as I wrote, I really should just write up my collection from last year and have a pdf I can point too.
Still superficial—and linking to already debunked posts adds nothing. The only credible point concerns sample size for guarantees; the ‘500 points are sufficient’ claim seems limited to a few Berkeley commenters. As the proverb says, ‘The dog barks, but the caravan moves on.’ Notably, Michael Jordan and Emmanuel Candès are strong supporters of conformal prediction.
How are you, my friend, the Predict Addict! I knew this would bring your delightful personality to my webpage. It's been far too long since we've spoken. How is life treating the world's favorite conformal prediction influencer?
Of course, I felt obliged to comment. Great, can't complain, trust you are as well.
Could you please expand on "But the theorems still only hold if your data is sampled from a distribution"? Your data collection mechanism defines the distribution. If it's something like "take the first 10 subjects/objects I see" you might lose exchangeability, but you still have a distribution over those 10. If I have a sampling frame over a defined and stable list and I choose a subset using a "random" method then I get a bunch of other properties, too. (Like an external "population" distribution)
I understand the utility of calling "a sampling frame over a defined and stable list and I choose a subset using a "random" method" a distribution. But who does it help to pretend that available convenience samples come from a distribution?
It can be useful in subsampling/permutation analyses, in that case, as in Fisher's Cowpea example. We may be using the term "distribution" differently I think of distribution as a fancy name for describing a collection of values. from which I can examine subsets and compare functions on those sets. Would that be the actuarial approach you alluded to earlier? I've rarely participated in actual "real" sampling and have mostly seen convenience/haphazard sets of data.
Your comment weirdly got quadruple-posted. I deleted the copies.
Perhaps my phrasing in the post is taking too much poetic liberty for something that is marginally precise. For me, a distribution is a function that maps events into [0,1]. And when I say data is from a distribution, it means that there is some function like this, which we can describe, and we can describe the sigma algebra, and we can describe what associated random variables are. Sometimes models like this are useful to describe reality, but usually we skip over all the technical details to check whether if we can specify or verify any of them.
As. noted above, I take a bottom-up approach, starting with discrete objects, and defining the distribution from them. Since I'm dealing with finite numbers of objects, and the measurements on them, I think in terms of power sets rather than sigma algebras. Continuous distributions come in as approximations. Frequencies as a sub-type of probabilities.
"And even when you do have a probability distribution from which you sample, you have to deal with the nebulous nature of the ex ante guarantee. Most conformal prediction guarantees are deeply misleading. They assert that the probability a new observation falls outside of your predicted set is 95%. But that probability is over the new event and the training set. It’s an ex ante guarantee, not an ex post guarantee."
Why is this a problem? When predicting, I care about the error on new observations. If I didn't care about new observations, I wouldn't bother with predicting them, right?
Because the guarantee is not about the future. It's a probability measured with respect to the future AND your training data. "The probability that my training data returns an interval that contains the next sample." Is very different than first using your training data to construct an interval I and then computing "The probability that the next sample is in the computed interval."
I see. That's indeed misleading.
What do we actually mean when we say "there is no distribution?" Does it mean "we don't know the distribution", "the data generating process is not homogeneous across the controlled parameter space (e.g time)" or something else?
I mean that modeling data as sampled from a probability distribution such which is invariant over all permutations is almost never justifiable.
I'm realizing this post needed to be a lot longer to fill in all of the details. But as I wrote, I really should just write up my collection from last year and have a pdf I can point too.
OK, so for example, the i.i.d. assumption isn't justified?
...yeah, I'm most likely wrong