19 Comments
User's avatar
Rob Nowak's avatar

A common viewpoint is that there is a set of all possible data out there (e.g., all images on the web). We collect a subset of these data and train our model. The main (and likely unreasonable) assumption is that the training subset is an iid sample (or uniformly sampled w/o replacement) from the set of all possible data. This assumption is the central ingredient in theoretical generalization bounds. While it's probably not perfectly reasonable, it does give us a framework for comparing models and algorithms, and aligns with common practices like hold-out validation. I think this is a useful model of the (training) data-generating distribution.

Expand full comment
Ben Recht's avatar

It's a perfectly fine model, but people start there and then casually slip into "assume xi are sampled iid from D." I have found that being constantly explicit about what the model is and how data is deliberately, intentionally generated is helpful.

Also, it's helpful to teach that most of the generalization bounds *immediately* apply to the w/o replacement setting. Working through the multiarmed bandit assuming data was sampled w/o replacement was elucidating for me.

Expand full comment
Rob Nowak's avatar

I agree. Deliberately, intentionally generated and selected data can be very useful. If you just view data as a giving you a bunch of (soft) constraints, then it's a bit clearer what is going on with training your model "on the data".

Expand full comment
Misha Belkin's avatar

I guess it is not clear whether the abstraction of the set of all images makes sense at all, even aside from any probability distribution on top of it. The process which generates images on the web is impossibly complicated. Yet, somehow, it generates images simple enough so that they can often be reliably identified.

Expand full comment
Rob Nowak's avatar

I agree, but the set of all images on the web isn't an abstraction. And a randomly selected subsample of that set does induce a distribution.

Expand full comment
Rob Nowak's avatar

although this set is dynamically growing, which is another complication

Expand full comment
Misha Belkin's avatar

Right, the set is dynamically growing quickly (or even shrinking sometimes) and, depending on how you access it, you may have very different representations of it. Not clear that thinking of it as some sort of abstract fixed set is best.

Expand full comment
Misha Belkin's avatar

To be fair, it is not like I have a viable alternative.

Expand full comment
Prateek Garg's avatar

Though imagining such stochastic processes help you build useful simulators of data.

Expand full comment
Ben Recht's avatar

Yes, 100% agree.

Expand full comment
Nico Formanek's avatar

Does this also mean the population does not exist?

Expand full comment
Ben Recht's avatar

The population might exist! At least I can create practical examples where it does. But you are right that the population also *might* not exist.

Expand full comment
Yaroslav Bulatov's avatar

Can I watch these lectures online?

Expand full comment
Misha Belkin's avatar

> This semester was the first time I taught machine learning and never felt like I was lying.

We all aspire to that!

Expand full comment
Matteo Capucci's avatar

The mythological data-generating distribution is a parameter. It does not exist in the sense you cannot afford to fix it in general. Indeed, it usually stands for ignorance, rather than non-determinism, though the latter is but a form of the first. You can think about it as an imprecise probability distribution.

Expand full comment
Bob Williamson's avatar

Hi Ben, you might find this preprint of interest https://arxiv.org/abs/2407.17395 Perhaps you are making similar points.

It is curious that machine learning folks, who often like to say they are "data driven" pay little attention to the data at a conceptual level. I think the issue is that the data is taken to be _given_ (as you allude -- someone "gives you" a bunch of data). The word "data" derives from the latin word (dare) meaning to give. We would be better served to think of capta.... but collecting data is not considered as sexy as making complex models.

One argument (that simply is wrong) is that the existence of distributions is justified by the "law" of large numbers. It is not. For an argument regarding this, and an alternative (to distributions), you might find this other paper of interest. https://www.sciencedirect.com/science/article/pii/S0888613X24000355 It shows what you get when you don't assume the "law" of large numbers holds (i.e. that relative frequencies converge). That the result is something other folks had studied for some time is pretty interesting. That coherent upper previsions arise elsewhere in ML (in fairness, DRO and even in SVMs!) suggests they are not so weird after all....

Regards, Bob Williamson

Expand full comment
Matt's avatar

This feels kinda like the ML:stats semantic battles of the 2000s. I don't see how formulating a set of MRIs as sample from a population is necessarily better or worse than as examples from a stochastic generating process. Both encourage different but valuable intuitions and bias towards using different but potentially valuable tools and formulations.

Expand full comment
Christopher Harshaw's avatar

Rejecting the i.i.d. super population assumption? Terminally based.

I gave the last lecture of my causal inference course on "Foundations, Limitations, and Controversies" and basically ripped into the notion of a super-population, which had so far gone unquestioned in the lectures. I had fun and I think the students enjoyed it too.

Keep on fighting the good (design based) fight ;)

Expand full comment
Aman Desai's avatar

Awesome article as always, Professor! In general, is there a reason we cannot just define the data-generating distribution as the distribution of samples from our population? It seems like we still need to assume the existence of some underlying group that we are drawing samples from (regardless of whether this group is the data-generating distribution or the population).

Expand full comment