6 Comments

FYI the deterministic identity you derived is a special case of Theorem 1 in https://arxiv.org/pdf/1112.1390 for the one dimensional case with constant covariates equal to 1 and taking the limit of "a" to 0. So, a similar equality holds more generally when you use online ridge regression. You also have the very same equality between marginal likelihood and predictive distributions in Gaussian Processes.

Expand full comment

Thanks! I'm sure there are many more of these identities lurking about. If you see other similar equalities, please send them my way.

It sort of makes sense to the EE side of my brain: these identities are saying that there are causal filters with the same prediction accuracy as acausal filters.

Expand full comment

Great post and cool identity.

An awkward bit is to explain "(a) the rules stated that you had to make the same prediction for all of these bits" when you really employ an online rule that does make evolving predictions.

I personally find the assumption of exchangeability to be a pretty intuitive match for "all missing outcomes are indistinguishable from each other -- both apriori and conditioned on anything you've seen so far." Online learning papers tend to acts like this is true, but shy away from stating it outright.

Expand full comment

I agree with your first point, but my explanation is just "averaged over time," the average of the first t bits is a good predictor of the next bit. Is that a deterministic statement about ergodicity? Don't get me wrong: I'm still confused.

With regard to exchangeability, I agree that it assumes less than iid, but it still assumes a hypothetical probability distribution. I'm trying to argue that this is unnecessary. Not only do I not like statistical models, but I worry about using the principle of indifference as a guide towards building said statistical models.

Expand full comment

I'll bet the same identity holds even under the weaker assumption of exchangeability, which I think is a little less divorced from reality than iid. Although now that I think of it maybe I'm splitting hairs: "bits drawn iid from a distribution you don't know and wish to learn and for whose p you perhaps have some prior distribution" basically IS the same thing as exhangeable by de Finetti.

Expand full comment

Yes, for exchangeable, you get the same identity as iid. But I don't think exchangeable is any less implausible than iid. It's aesthetics. As de Finetti forcefully told us, probability does not exist.

Expand full comment