There is no data-generating distribution

Dec 9, 2025

Reflecting on teaching machine learning again. Again.

22 Comments

Hi Ben, you might find this preprint of interest https://arxiv.org/abs/2407.17395 Perhaps you are making similar points.

It is curious that machine learning folks, who often like to say they are "data driven" pay little attention to the data at a conceptual level. I think the issue is that the data is taken to be _given_ (as you allude -- someone "gives you" a bunch of data). The word "data" derives from the latin word (dare) meaning to give. We would be better served to think of capta.... but collecting data is not considered as sexy as making complex models.

One argument (that simply is wrong) is that the existence of distributions is justified by the "law" of large numbers. It is not. For an argument regarding this, and an alternative (to distributions), you might find this other paper of interest. https://www.sciencedirect.com/science/article/pii/S0888613X24000355 It shows what you get when you don't assume the "law" of large numbers holds (i.e. that relative frequencies converge). That the result is something other folks had studied for some time is pretty interesting. That coherent upper previsions arise elsewhere in ML (in fairness, DRO and even in SVMs!) suggests they are not so weird after all....

Regards, Bob Williamson

Reply (1)

Ben Recht

Dec 12

I wish I had known about your preprint at the beginning of the semester. I would have assigned it as reading! I think we are in agreement about almost all of these points.

I also agree with you that it is wrong to justify distributions by the law of large numbers. Theory always works the other way around: we make whatever assumptions we can to ensure the LLN is true. I've noted on the blog a few times how these assumptions fall on their face when they are used to make predictions about practice (Notably: https://arxiv.org/abs/1902.10811).

I'm going to need some time to digest the imprecise probability paper, but I am intrigued and will do my best to work through it. Thank you for the pointer.

Reply (1)

Bob Williamson

Dec 12

Thanks. The basic idea of the imprecise probability paper is simple (and not original to us!) --- when sequences of relative frequencies do not converge they still have a set of cluster points. Work with that set instead. The historically interesting point is that Ivanenko (on whose work we very heavily relied) worked this out long ago, but nobody noticed (we found his book very tough going).

I am a big fan of your image net does not generalise to image net paper!

Finally, you also might be interested in the lectures in my course from a few years ago on the presumption of the existence of a distribution. Whole course: https://www.youtube.com/playlist?list=PL05umP7R6ij2mpH-oHzBWlz8OCApvrawJ

Lecture on distributions: https://www.youtube.com/watch?v=7JS0Mo8xJI4&list=PL05umP7R6ij2mpH-oHzBWlz8OCApvrawJ&index=9

I talk about insurance (which I know you see is relevant)

Lecture on choices of categories / labels: https://www.youtube.com/watch?v=JvFRTAGpKeY&list=PL05umP7R6ij2mpH-oHzBWlz8OCApvrawJ&index=10&pp=iAQB

Reply (1)

Maxim Raginsky

Dec 14

Hi Bob, thank you for all these references (especially for pointing out Ivanenko's book, I was not aware of his work).

arg min

There is no data-generating distribution