Discussion about this post

User's avatar
Bob Williamson's avatar

Hi Ben, you might find this preprint of interest https://arxiv.org/abs/2407.17395 Perhaps you are making similar points.

It is curious that machine learning folks, who often like to say they are "data driven" pay little attention to the data at a conceptual level. I think the issue is that the data is taken to be _given_ (as you allude -- someone "gives you" a bunch of data). The word "data" derives from the latin word (dare) meaning to give. We would be better served to think of capta.... but collecting data is not considered as sexy as making complex models.

One argument (that simply is wrong) is that the existence of distributions is justified by the "law" of large numbers. It is not. For an argument regarding this, and an alternative (to distributions), you might find this other paper of interest. https://www.sciencedirect.com/science/article/pii/S0888613X24000355 It shows what you get when you don't assume the "law" of large numbers holds (i.e. that relative frequencies converge). That the result is something other folks had studied for some time is pretty interesting. That coherent upper previsions arise elsewhere in ML (in fairness, DRO and even in SVMs!) suggests they are not so weird after all....

Regards, Bob Williamson

Prateek Garg's avatar

Though imagining such stochastic processes help you build useful simulators of data.

20 more comments...

No posts

Ready for more?