It is curious that machine learning folks, who often like to say they are "data driven" pay little attention to the data at a conceptual level. I think the issue is that the data is taken to be _given_ (as you allude -- someone "gives you" a bunch of data). The word "data" derives from the latin word (dare) meaning to give. We would be better served to think of capta.... but collecting data is not considered as sexy as making complex models.
One argument (that simply is wrong) is that the existence of distributions is justified by the "law" of large numbers. It is not. For an argument regarding this, and an alternative (to distributions), you might find this other paper of interest. https://www.sciencedirect.com/science/article/pii/S0888613X24000355 It shows what you get when you don't assume the "law" of large numbers holds (i.e. that relative frequencies converge). That the result is something other folks had studied for some time is pretty interesting. That coherent upper previsions arise elsewhere in ML (in fairness, DRO and even in SVMs!) suggests they are not so weird after all....
I wish I had known about your preprint at the beginning of the semester. I would have assigned it as reading! I think we are in agreement about almost all of these points.
I also agree with you that it is wrong to justify distributions by the law of large numbers. Theory always works the other way around: we make whatever assumptions we can to ensure the LLN is true. I've noted on the blog a few times how these assumptions fall on their face when they are used to make predictions about practice (Notably: https://arxiv.org/abs/1902.10811).
I'm going to need some time to digest the imprecise probability paper, but I am intrigued and will do my best to work through it. Thank you for the pointer.
Thanks. The basic idea of the imprecise probability paper is simple (and not original to us!) --- when sequences of relative frequencies do not converge they still have a set of cluster points. Work with that set instead. The historically interesting point is that Ivanenko (on whose work we very heavily relied) worked this out long ago, but nobody noticed (we found his book very tough going).
I am a big fan of your image net does not generalise to image net paper!
A common viewpoint is that there is a set of all possible data out there (e.g., all images on the web). We collect a subset of these data and train our model. The main (and likely unreasonable) assumption is that the training subset is an iid sample (or uniformly sampled w/o replacement) from the set of all possible data. This assumption is the central ingredient in theoretical generalization bounds. While it's probably not perfectly reasonable, it does give us a framework for comparing models and algorithms, and aligns with common practices like hold-out validation. I think this is a useful model of the (training) data-generating distribution.
It's a perfectly fine model, but people start there and then casually slip into "assume xi are sampled iid from D." I have found that being constantly explicit about what the model is and how data is deliberately, intentionally generated is helpful.
Also, it's helpful to teach that most of the generalization bounds *immediately* apply to the w/o replacement setting. Working through the multiarmed bandit assuming data was sampled w/o replacement was elucidating for me.
I agree. Deliberately, intentionally generated and selected data can be very useful. If you just view data as a giving you a bunch of (soft) constraints, then it's a bit clearer what is going on with training your model "on the data".
I guess it is not clear whether the abstraction of the set of all images makes sense at all, even aside from any probability distribution on top of it. The process which generates images on the web is impossibly complicated. Yet, somehow, it generates images simple enough so that they can often be reliably identified.
Right, the set is dynamically growing quickly (or even shrinking sometimes) and, depending on how you access it, you may have very different representations of it. Not clear that thinking of it as some sort of abstract fixed set is best.
This feels kinda like the ML:stats semantic battles of the 2000s. I don't see how formulating a set of MRIs as sample from a population is necessarily better or worse than as examples from a stochastic generating process. Both encourage different but valuable intuitions and bias towards using different but potentially valuable tools and formulations.
Rejecting the i.i.d. super population assumption? Terminally based.
I gave the last lecture of my causal inference course on "Foundations, Limitations, and Controversies" and basically ripped into the notion of a super-population, which had so far gone unquestioned in the lectures. I had fun and I think the students enjoyed it too.
The mythological data-generating distribution is a parameter. It does not exist in the sense you cannot afford to fix it in general. Indeed, it usually stands for ignorance, rather than non-determinism, though the latter is but a form of the first. You can think about it as an imprecise probability distribution.
Awesome article as always, Professor! In general, is there a reason we cannot just define the data-generating distribution as the distribution of samples from our population? It seems like we still need to assume the existence of some underlying group that we are drawing samples from (regardless of whether this group is the data-generating distribution or the population).
Hi Ben, you might find this preprint of interest https://arxiv.org/abs/2407.17395 Perhaps you are making similar points.
It is curious that machine learning folks, who often like to say they are "data driven" pay little attention to the data at a conceptual level. I think the issue is that the data is taken to be _given_ (as you allude -- someone "gives you" a bunch of data). The word "data" derives from the latin word (dare) meaning to give. We would be better served to think of capta.... but collecting data is not considered as sexy as making complex models.
One argument (that simply is wrong) is that the existence of distributions is justified by the "law" of large numbers. It is not. For an argument regarding this, and an alternative (to distributions), you might find this other paper of interest. https://www.sciencedirect.com/science/article/pii/S0888613X24000355 It shows what you get when you don't assume the "law" of large numbers holds (i.e. that relative frequencies converge). That the result is something other folks had studied for some time is pretty interesting. That coherent upper previsions arise elsewhere in ML (in fairness, DRO and even in SVMs!) suggests they are not so weird after all....
Regards, Bob Williamson
I wish I had known about your preprint at the beginning of the semester. I would have assigned it as reading! I think we are in agreement about almost all of these points.
I also agree with you that it is wrong to justify distributions by the law of large numbers. Theory always works the other way around: we make whatever assumptions we can to ensure the LLN is true. I've noted on the blog a few times how these assumptions fall on their face when they are used to make predictions about practice (Notably: https://arxiv.org/abs/1902.10811).
I'm going to need some time to digest the imprecise probability paper, but I am intrigued and will do my best to work through it. Thank you for the pointer.
Thanks. The basic idea of the imprecise probability paper is simple (and not original to us!) --- when sequences of relative frequencies do not converge they still have a set of cluster points. Work with that set instead. The historically interesting point is that Ivanenko (on whose work we very heavily relied) worked this out long ago, but nobody noticed (we found his book very tough going).
I am a big fan of your image net does not generalise to image net paper!
Finally, you also might be interested in the lectures in my course from a few years ago on the presumption of the existence of a distribution. Whole course: https://www.youtube.com/playlist?list=PL05umP7R6ij2mpH-oHzBWlz8OCApvrawJ
Lecture on distributions: https://www.youtube.com/watch?v=7JS0Mo8xJI4&list=PL05umP7R6ij2mpH-oHzBWlz8OCApvrawJ&index=9
I talk about insurance (which I know you see is relevant)
Lecture on choices of categories / labels: https://www.youtube.com/watch?v=JvFRTAGpKeY&list=PL05umP7R6ij2mpH-oHzBWlz8OCApvrawJ&index=10&pp=iAQB
Hi Bob, thank you for all these references (especially for pointing out Ivanenko's book, I was not aware of his work).
Though imagining such stochastic processes help you build useful simulators of data.
Yes, 100% agree.
A common viewpoint is that there is a set of all possible data out there (e.g., all images on the web). We collect a subset of these data and train our model. The main (and likely unreasonable) assumption is that the training subset is an iid sample (or uniformly sampled w/o replacement) from the set of all possible data. This assumption is the central ingredient in theoretical generalization bounds. While it's probably not perfectly reasonable, it does give us a framework for comparing models and algorithms, and aligns with common practices like hold-out validation. I think this is a useful model of the (training) data-generating distribution.
It's a perfectly fine model, but people start there and then casually slip into "assume xi are sampled iid from D." I have found that being constantly explicit about what the model is and how data is deliberately, intentionally generated is helpful.
Also, it's helpful to teach that most of the generalization bounds *immediately* apply to the w/o replacement setting. Working through the multiarmed bandit assuming data was sampled w/o replacement was elucidating for me.
I agree. Deliberately, intentionally generated and selected data can be very useful. If you just view data as a giving you a bunch of (soft) constraints, then it's a bit clearer what is going on with training your model "on the data".
I guess it is not clear whether the abstraction of the set of all images makes sense at all, even aside from any probability distribution on top of it. The process which generates images on the web is impossibly complicated. Yet, somehow, it generates images simple enough so that they can often be reliably identified.
I agree, but the set of all images on the web isn't an abstraction. And a randomly selected subsample of that set does induce a distribution.
although this set is dynamically growing, which is another complication
Right, the set is dynamically growing quickly (or even shrinking sometimes) and, depending on how you access it, you may have very different representations of it. Not clear that thinking of it as some sort of abstract fixed set is best.
To be fair, it is not like I have a viable alternative.
Does this also mean the population does not exist?
The population might exist! At least I can create practical examples where it does. But you are right that the population also *might* not exist.
Can I watch these lectures online?
This feels kinda like the ML:stats semantic battles of the 2000s. I don't see how formulating a set of MRIs as sample from a population is necessarily better or worse than as examples from a stochastic generating process. Both encourage different but valuable intuitions and bias towards using different but potentially valuable tools and formulations.
Rejecting the i.i.d. super population assumption? Terminally based.
I gave the last lecture of my causal inference course on "Foundations, Limitations, and Controversies" and basically ripped into the notion of a super-population, which had so far gone unquestioned in the lectures. I had fun and I think the students enjoyed it too.
Keep on fighting the good (design based) fight ;)
> This semester was the first time I taught machine learning and never felt like I was lying.
We all aspire to that!
The mythological data-generating distribution is a parameter. It does not exist in the sense you cannot afford to fix it in general. Indeed, it usually stands for ignorance, rather than non-determinism, though the latter is but a form of the first. You can think about it as an imprecise probability distribution.
Awesome article as always, Professor! In general, is there a reason we cannot just define the data-generating distribution as the distribution of samples from our population? It seems like we still need to assume the existence of some underlying group that we are drawing samples from (regardless of whether this group is the data-generating distribution or the population).