There is no data-generating distribution

Reflecting on teaching machine learning again. Again.

Dec 09, 2025

This is a live blog of the final lecture of the 2025 edition of my graduate machine learning class “Patterns, Predictions, and Actions.” A full Table of Contents is here. I tried to summarize my semester reflections in class on Thursday, but found my thoughts haven’t quite settled yet. I’m hoping a week of posting will help me sort through it.

This semester was the first time I taught machine learning and never felt like I was lying. Over the past decade or so, I’ve been working to strip from the curriculum field-making myths like overfitting and bias-variance tradeoffs. As I dug into the roots of the misunderstandings and misinterpretations of machine learning, I kept hitting against the unassailable belief in the data-generating distribution. There is no data-generating distribution. This semester, I managed to remove that too.

Machine learning is all about recognizing patterns in collected data and transforming these patterns into predictions and actions. This transformation requires a model for how the data is linked to the outcomes we aim to forecast or act upon. We have a sample that we believe is representative of a population. The model declares, implicitly or explicitly, what “representative” means. The most ubiquitous model in machine learning posits that datasets are comprised of samples from a probability distribution. This is the data-generating distribution. It doesn’t exist.

Just think about it. What is the stochastic process that creates the radiology images in a cancer detection dataset? What is the stochastic process that generates click-through data for a machine learning engineer building a recommendation system? What is the stochastic process that results in the millions of books literally torn to pieces by Anthropic for the pretraining of their language model?

There is no data-generating distribution. And yet, machine learning theorists and practitioners love to talk about it. Machine learning theorists are at least upset about it. They love to make arguments that are “distribution-free,” but then they always have a data-generating distribution hiding in the background. This semester, I managed to stitch together a truly distribution-free story. Randomness can be created and applied algorithmically, but nature need not be modeled as god playing dice. All of the randomness is made or imagined by the machine learning engineer.

To set the stage, we first need to think about populations, the imagined examples you want to make predictions about or take actions upon. The central tenet of this class is that our model of the population dictates what we see in our samples. Once your conception of the population is set, what you do with samples is determined. Most of the mucking around with data targets action at the population level, so think there first.

Indeed, the first step in any forecasting or decision-making problem is to discuss how predictions will be scored at the population level. Once you conceive of a scoring system, you can compare the value of different prediction methods. The scoring determines the best predictions.

As an example of how this population-level thinking works, say that you have to decide whether to give an entire population of people a drug or not. You know in advance how everyone responds upon getting the drug. Your central planner tells you the required quality-adjusted life years needed to keep public opinion high. You plug in the two options and decide to administer the drug based on which score is higher in your epidemiological metric.

This is more or less how decision theory works. The optimal prediction and decision are completely determined by the score function you choose. I call this metrical determinism. Different cost-benefit of overdiagnosis will yield different policy recommendations for cancer screening. Different philosophies for demographics of an incoming class will yield different admission rules. Different risk tolerances determine different investment strategies.

The population model determines the consequences of optimization-based decision making. When your metric is an average, your optimal prediction is necessarily statistical. Maximizing average benefit requires statistical rules. Trade-offs between error types and Neyman-Pearson decision theory all derive from the population properties. All of the results concerning the impossibility, incompatibility, and incoherence of fair machine learning are population-level arguments. None of these deductions require a data-generating distribution.

Being clear about what we think the population is is critical. We are going to make decisions at the population level based on sample-level evidence, but sample-level decisions try to approximate ideal population-level decisions given the statistical facts at hand.

To make such sample-level approximations, we need to make assumptions about how the sample is linked to the population. There are many models in machine learning, each defined by specific evaluation criteria.

Batch Learning: Someone hands you a sample of data. You can do whatever you want with it. You are evaluated by how well your model does on the remainder of the population. To justify sample-level conclusions, we typically choose a model so that the law of large numbers holds. This could be that the data are iid samples from the population. This could be that the data is an actively randomly sampled subset of the population (these two are very different assumptions). The model could also be deterministic, with assumptions about how observations and predictions are linked—people like to assume linearity. Some combination of modeling assumptions gives us confidence that data from the past will be representative of observations in the future.

Online Learning: You run through the data sequentially, make predictions, and update your models based on the accuracy of each sequential prediction. Online learning is fascinating because no randomness is required to derive theorems. It is fundamentally a non-stochastic theory. The key is to compare with the best predictions at the population level and to show that, on average, you match population-level performance as the sample grows. If pattern recognition is possible, your algorithm will recognize them. The classic Perceptron learning bound is the classic example of online learning, but similar bounds can be derived for gradient methods on linear models and many other machine learning applications. The downside here is the clunky, unintuitive evaluation metric that is regret.

Empiricist Learning: Your score is at the population level, but you can act on the population to make decisions. You assume that you have a mechanism for selecting individuals from the population and acting on them (i.e., choosing some representative individuals from the group for an experiment). Based on the selection mechanisms available, you can design algorithms with ex-ante guarantees about population performance. This was the model I described in Lecture 19 on adaptive experiment design. When I do the class again next time, I’m going to clarify that you can model batch learning this way too. More broadly, this perspective connects empirical risk minimization in machine learning to decision theory without Bayesian modeling. I don’t know what to call this perspective, which lies between the batch and online models, but it’s the one I’m most excited about. A statistically minded reader might call it “design-based” machine learning.

The data-generating distribution may be a convenient mnemonic crutch for machine learning engineers. But it’s not necessary to understand machine learning. We use the same methods regardless of whether the world is producing randomness. Why that is the case is interesting. Or at least, it’s interesting to me. To make sense of our overly data-driven culture, we should figure out when we actually need statistical models of data. In machine learning, the answer could be never.

From the outset of our collaboration, Moritz and I discussed teaching this machine learning course without data-generating distributions. It took us a few iterations of teaching the class to get there. Perhaps I can entice him to write a second edition of PPA where we finish the job.

Bob Williamson

Dec 10

Hi Ben, you might find this preprint of interest https://arxiv.org/abs/2407.17395 Perhaps you are making similar points.

It is curious that machine learning folks, who often like to say they are "data driven" pay little attention to the data at a conceptual level. I think the issue is that the data is taken to be _given_ (as you allude -- someone "gives you" a bunch of data). The word "data" derives from the latin word (dare) meaning to give. We would be better served to think of capta.... but collecting data is not considered as sexy as making complex models.

One argument (that simply is wrong) is that the existence of distributions is justified by the "law" of large numbers. It is not. For an argument regarding this, and an alternative (to distributions), you might find this other paper of interest. https://www.sciencedirect.com/science/article/pii/S0888613X24000355 It shows what you get when you don't assume the "law" of large numbers holds (i.e. that relative frequencies converge). That the result is something other folks had studied for some time is pretty interesting. That coherent upper previsions arise elsewhere in ML (in fairness, DRO and even in SVMs!) suggests they are not so weird after all....

Regards, Bob Williamson

3 replies by Ben Recht and others

Prateek Garg

Dec 9

Though imagining such stochastic processes help you build useful simulators of data.

1 reply by Ben Recht

20 more comments...

arg min

Discussion about this post

Ready for more?