John Venn wrote 'The Logic of Chance' in 1866 where he mentions statistical regularity increases as you collect more data over time, but that the regularity will break down after enough time has passed. This was in the context of lifespan and heights.

Forgive me if this sounds ignorant, but in CS281A, we had about 2 lectures (maybe more?) going over generalization error in machine learning. From what I recall (and also what wikipedia defines generalization error as), it is supposed to be "a measure of how accurately an algorithm is able to predict outcome values for previously unseen data". And of course, I remember you did not speak very highly of the conventions that arise from the machine learning community as a result of some of those theories. In this blog, however, you argue external validity seems to be moreso a philosophical question based on your beliefs attached to a strong conceptual model, which as a result appear more qualitative than quantitative. So my question and asking for your opinions/thoughts on this is: what did we learn from generalization error that is useful for us to takeaway when we utilize machine learning algorithms in real world applications given this blog seems to handwave that it was never all that useful?

My $.02. Mathematically, generalization error rests on the assumption of IID data, which is exactly the "assuming external validity" holds that Ben refers to.

Practically, lots of work in applied ML can be viewed as attempting to validate some or all of the M-STOUT conditions Ben suggests: collecting training data from as wide a range of settings as possible, normalizing the hardware or features, collecting new data over time and retraining, etc.

I do not understand your problem. "Statistics" does not "argue that rates of some events should be the same in different contexts". Statistical conclusions are valid if/as long as the assumptions those are based on are valid. As I see it, your apples example should read: Based on observations of apples on Berkeley Bowl, statistics tells us that an iid sample of apples from Monterey Market should contain about 90% red ones, assuming that the distribution of colours is the same as in Berkeley Bowl.

Any assumptions are derived from theories about the world. The theories may be correct or not. To find out which are correct and which are not, you perform experiments. Where is the problem here? Isn't this just what science is about?

John Venn wrote 'The Logic of Chance' in 1866 where he mentions statistical regularity increases as you collect more data over time, but that the regularity will break down after enough time has passed. This was in the context of lifespan and heights.

Added to my reading list!

Forgive me if this sounds ignorant, but in CS281A, we had about 2 lectures (maybe more?) going over generalization error in machine learning. From what I recall (and also what wikipedia defines generalization error as), it is supposed to be "a measure of how accurately an algorithm is able to predict outcome values for previously unseen data". And of course, I remember you did not speak very highly of the conventions that arise from the machine learning community as a result of some of those theories. In this blog, however, you argue external validity seems to be moreso a philosophical question based on your beliefs attached to a strong conceptual model, which as a result appear more qualitative than quantitative. So my question and asking for your opinions/thoughts on this is: what did we learn from generalization error that is useful for us to takeaway when we utilize machine learning algorithms in real world applications given this blog seems to handwave that it was never all that useful?

My $.02. Mathematically, generalization error rests on the assumption of IID data, which is exactly the "assuming external validity" holds that Ben refers to.

Practically, lots of work in applied ML can be viewed as attempting to validate some or all of the M-STOUT conditions Ben suggests: collecting training data from as wide a range of settings as possible, normalizing the hardware or features, collecting new data over time and retraining, etc.

I agree with everything Rif wrote.

I'll add that the most practically useful parts of generalization theory concern questions of *internal* validity like the holdout method. https://www.argmin.net/p/you-got-a-9-to-5-so-ill-take-the

I do not understand your problem. "Statistics" does not "argue that rates of some events should be the same in different contexts". Statistical conclusions are valid if/as long as the assumptions those are based on are valid. As I see it, your apples example should read: Based on observations of apples on Berkeley Bowl, statistics tells us that an iid sample of apples from Monterey Market should contain about 90% red ones, assuming that the distribution of colours is the same as in Berkeley Bowl.

Any assumptions are derived from theories about the world. The theories may be correct or not. To find out which are correct and which are not, you perform experiments. Where is the problem here? Isn't this just what science is about?