A common viewpoint is that there is a set of all possible data out there (e.g., all images on the web). We collect a subset of these data and train our model. The main (and likely unreasonable) assumption is that the training subset is an iid sample (or uniformly sampled w/o replacement) from the set of all possible data. This assumption is the central ingredient in theoretical generalization bounds. While it's probably not perfectly reasonable, it does give us a framework for comparing models and algorithms, and aligns with common practices like hold-out validation. I think this is a useful model of the (training) data-generating distribution.
It's a perfectly fine model, but people start there and then casually slip into "assume xi are sampled iid from D." I have found that being constantly explicit about what the model is and how data is deliberately, intentionally generated is helpful.
Also, it's helpful to teach that most of the generalization bounds *immediately* apply to the w/o replacement setting. Working through the multiarmed bandit assuming data was sampled w/o replacement was elucidating for me.
I agree. Deliberately, intentionally generated and selected data can be very useful. If you just view data as a giving you a bunch of (soft) constraints, then it's a bit clearer what is going on with training your model "on the data".
I guess it is not clear whether the abstraction of the set of all images makes sense at all, even aside from any probability distribution on top of it. The process which generates images on the web is impossibly complicated. Yet, somehow, it generates images simple enough so that they can often be reliably identified.
Right, the set is dynamically growing quickly (or even shrinking sometimes) and, depending on how you access it, you may have very different representations of it. Not clear that thinking of it as some sort of abstract fixed set is best.
The mythological data-generating distribution is a parameter. It does not exist in the sense you cannot afford to fix it in general. Indeed, it usually stands for ignorance, rather than non-determinism, though the latter is but a form of the first. You can think about it as an imprecise probability distribution.
It is curious that machine learning folks, who often like to say they are "data driven" pay little attention to the data at a conceptual level. I think the issue is that the data is taken to be _given_ (as you allude -- someone "gives you" a bunch of data). The word "data" derives from the latin word (dare) meaning to give. We would be better served to think of capta.... but collecting data is not considered as sexy as making complex models.
One argument (that simply is wrong) is that the existence of distributions is justified by the "law" of large numbers. It is not. For an argument regarding this, and an alternative (to distributions), you might find this other paper of interest. https://www.sciencedirect.com/science/article/pii/S0888613X24000355 It shows what you get when you don't assume the "law" of large numbers holds (i.e. that relative frequencies converge). That the result is something other folks had studied for some time is pretty interesting. That coherent upper previsions arise elsewhere in ML (in fairness, DRO and even in SVMs!) suggests they are not so weird after all....
This feels kinda like the ML:stats semantic battles of the 2000s. I don't see how formulating a set of MRIs as sample from a population is necessarily better or worse than as examples from a stochastic generating process. Both encourage different but valuable intuitions and bias towards using different but potentially valuable tools and formulations.
Rejecting the i.i.d. super population assumption? Terminally based.
I gave the last lecture of my causal inference course on "Foundations, Limitations, and Controversies" and basically ripped into the notion of a super-population, which had so far gone unquestioned in the lectures. I had fun and I think the students enjoyed it too.
Awesome article as always, Professor! In general, is there a reason we cannot just define the data-generating distribution as the distribution of samples from our population? It seems like we still need to assume the existence of some underlying group that we are drawing samples from (regardless of whether this group is the data-generating distribution or the population).
A common viewpoint is that there is a set of all possible data out there (e.g., all images on the web). We collect a subset of these data and train our model. The main (and likely unreasonable) assumption is that the training subset is an iid sample (or uniformly sampled w/o replacement) from the set of all possible data. This assumption is the central ingredient in theoretical generalization bounds. While it's probably not perfectly reasonable, it does give us a framework for comparing models and algorithms, and aligns with common practices like hold-out validation. I think this is a useful model of the (training) data-generating distribution.
It's a perfectly fine model, but people start there and then casually slip into "assume xi are sampled iid from D." I have found that being constantly explicit about what the model is and how data is deliberately, intentionally generated is helpful.
Also, it's helpful to teach that most of the generalization bounds *immediately* apply to the w/o replacement setting. Working through the multiarmed bandit assuming data was sampled w/o replacement was elucidating for me.
I agree. Deliberately, intentionally generated and selected data can be very useful. If you just view data as a giving you a bunch of (soft) constraints, then it's a bit clearer what is going on with training your model "on the data".
I guess it is not clear whether the abstraction of the set of all images makes sense at all, even aside from any probability distribution on top of it. The process which generates images on the web is impossibly complicated. Yet, somehow, it generates images simple enough so that they can often be reliably identified.
I agree, but the set of all images on the web isn't an abstraction. And a randomly selected subsample of that set does induce a distribution.
although this set is dynamically growing, which is another complication
Right, the set is dynamically growing quickly (or even shrinking sometimes) and, depending on how you access it, you may have very different representations of it. Not clear that thinking of it as some sort of abstract fixed set is best.
To be fair, it is not like I have a viable alternative.
Though imagining such stochastic processes help you build useful simulators of data.
Yes, 100% agree.
Does this also mean the population does not exist?
The population might exist! At least I can create practical examples where it does. But you are right that the population also *might* not exist.
Can I watch these lectures online?
> This semester was the first time I taught machine learning and never felt like I was lying.
We all aspire to that!
The mythological data-generating distribution is a parameter. It does not exist in the sense you cannot afford to fix it in general. Indeed, it usually stands for ignorance, rather than non-determinism, though the latter is but a form of the first. You can think about it as an imprecise probability distribution.
Hi Ben, you might find this preprint of interest https://arxiv.org/abs/2407.17395 Perhaps you are making similar points.
It is curious that machine learning folks, who often like to say they are "data driven" pay little attention to the data at a conceptual level. I think the issue is that the data is taken to be _given_ (as you allude -- someone "gives you" a bunch of data). The word "data" derives from the latin word (dare) meaning to give. We would be better served to think of capta.... but collecting data is not considered as sexy as making complex models.
One argument (that simply is wrong) is that the existence of distributions is justified by the "law" of large numbers. It is not. For an argument regarding this, and an alternative (to distributions), you might find this other paper of interest. https://www.sciencedirect.com/science/article/pii/S0888613X24000355 It shows what you get when you don't assume the "law" of large numbers holds (i.e. that relative frequencies converge). That the result is something other folks had studied for some time is pretty interesting. That coherent upper previsions arise elsewhere in ML (in fairness, DRO and even in SVMs!) suggests they are not so weird after all....
Regards, Bob Williamson
This feels kinda like the ML:stats semantic battles of the 2000s. I don't see how formulating a set of MRIs as sample from a population is necessarily better or worse than as examples from a stochastic generating process. Both encourage different but valuable intuitions and bias towards using different but potentially valuable tools and formulations.
Rejecting the i.i.d. super population assumption? Terminally based.
I gave the last lecture of my causal inference course on "Foundations, Limitations, and Controversies" and basically ripped into the notion of a super-population, which had so far gone unquestioned in the lectures. I had fun and I think the students enjoyed it too.
Keep on fighting the good (design based) fight ;)
Awesome article as always, Professor! In general, is there a reason we cannot just define the data-generating distribution as the distribution of samples from our population? It seems like we still need to assume the existence of some underlying group that we are drawing samples from (regardless of whether this group is the data-generating distribution or the population).