On Twitter, Lily-Belle Sweet brought up a great question. How do we justify leave-one-out error or the holdout method or cross validation without an appeal to i.i.d.?

Here’s my take on this. When we collect a training data set, it serves as a population itself. If we subsample from this training data, then the i.i.d. assumption holds because we are being intentional. Hence, bootstrap methods are telling us something about internal validity. They are telling us something about how the method performs when the superpopulation is our training set. Then, to generalize beyond this to new data, we just have to convince ourselves that the training set is a representative sample of the data we’ll see.

Great response, thanks! I agree that bootstrap methods tell us about internal validity. But this is never the question we're asking, so 'just' convincing ourselves that the training set is representative is actually a pretty big ask, no? What does 'representative' really mean here, and how can we convince ourselves of that?

This one I don't have a good answer to. But I'll say that this problem is pervasive in science! How do we know the results of a randomized clinical trial will be transportable to new patients? This is basically the same question, right?

Yep, big question - let me rephrase my thoughts a bit. Given that data isn't i.i.d, another way to assess generalisation performance is to consider that in your test data selection instead of bootstrapping, right? E.g., for temporal data: bootstrapping would be enough in terms of measuring training set performance, but using a temporal split gives us a better measurement of generalisation (because the model may rely on the temporal autocorrelation). And of course there can be umpteen more confounders and all kinds of mess going on under-the-hood...

So if we want to generalise beyond the training set, isn't the question not 'is our data set representative?' but 'is our test set as difficult as unseen data could be'?

Totally agree with these examples. I think both questions are important and can't be ignored. But what is the theory that dictates how to do a good cross-validation split? Or what is the theory that tells us which features should be used for prediction? These are the sorts of questions I'm going to try to partially answer in the next couple of lectures. The answer has to involve bringing assembled knowledge (dare I say priors) to the data collection, measurement, and benchmarking. But I'm not convinced that any of these questions are answered by insights from what we typically construe as Mathematical Statistics.

The two questions you raise are exactly the topics I'm interested in, particularly in the case where we don't know the underlying causal process! And to your last point - I am fairly new to generalization theory but have had the same hunch, so it's validating to see that you are also not convinced.

I have enjoyed reading "Lost in Math: How Beauty Leads Physics Astray" by Sabine Hossenfelder. It is interesting how even in physics certain ideas can continue to be pursued even when empirical validation is elusive.

If we do not develop theory, it is the end of science, I am afraid.

> The issue with a lot of statistical theory is that we take metaphysical beliefs (e.g., data is i.i.d.), turn these beliefs into mathematical axioms, and then forget that these beliefs weren't particularly grounded in the first place.

Reminds me of a quote I read yesterday: 'Everybody believes in the [normal approximation], the experimenters because they think it is a mathematical theorem, the mathematicians because they believe it is an experimental fact.' -- G. Lippmann

Sep 14, 2023·edited Sep 14, 2023Liked by Ben Recht

As you know from our multiple conversations, I personally view the statistical approach as fundamentally broken. It's based on poor epistemology and poor understanding of probability (in particular, on the conflation of the object language of empirical averages and the metalanguage of expected values with respect to a probability measure). On the other hand, there is a sensible pragmatist defense, which requires a great deal of commitment to rigorous testing, verification, and, dare we say, maintenance. In the context of ML systems, maintenance means regularly checking whether the nominal environment for which the system has been designed (in our case, the "superpopulation") is still an adequate representation of the actual operating conditions. Am I going to quote Willems again? You bet I am:

"In engineering (and prescriptive aspects of economics) one can, it seems to me, take the following intermediate position. An algorithm-based engineering device, say in signal processing, communication, or control, comes with a set of ‘certificates’, i.e. statements that guarantee that the device or the algorithm will work well under certain specified circumstances. These circumstances need not be the ones under which the device will operate in actual practice. They may not even be circumstances which can happen in the real world. These certificates are merely quality specifications. Examples of such performance guarantees may be that an error correcting code corrects an encoded message that is received with on the average not more than a certain percentage of errors, or that a filter generates the conditional expectation of an unobserved signal from an observed one under certain prescribed stochastic assumptions, or that a controller ensures robust stability if the plant is in a certain neighborhood of a nominal one, etc."

"Examples of such performance guarantees may be that an error correcting code corrects an encoded message that is received with on the average not more than a certain percentage of errors, or that a filter generates the conditional expectation of an unobserved signal from an observed one under certain prescribed stochastic assumptions, or that a controller ensures robust stability if the plant is in a certain neighborhood of a nominal one, etc."

and tell me what you think Willems might say about machine learning?

My take is that he would probably suggest the use of certificates based on something like this: https://arxiv.org/abs/1009.0679 (I liked that paper when it came out, it is probably worth revisiting).

Also, going along with your prescription to start with linear models to understand what the hell is going on, this book by Willems and collaborators seems like something that we should pay close attention to:

"The topic of this book is fitting models to data. We would like the model to fit the data exactly; however, in practice often the best that can be achieved is only an approximate fit. A fundamental question in approximate modeling is how to quantify the lack of fit between the data and the model. In this chapter, we explain and illustrate two different approaches for answering this question.

The first one, called latency, augments the model with additional unobserved variables that allow the augmented model to fit the data exactly. Many classical approximate modeling techniques such as the least squares and autoregressive moving average exogenous (ARMAX) system identification methods are latency oriented methods. The statistical tool corresponding to the latency approach is regression.

An alternative approach, called misfit, resolves the data-model mismatch by correcting the data, so that it fits the model exactly. The main example of the misfit approach is the total least squares method and the corresponding statistical tool is errors-in-variables regression."

With regards to model fitting, there's a different question when fitting a dynamical system for control vs fitting a convnet for discriminating dog breeds. That is, we know we just need to fit the dynamical system well enough that it can be robustly controlled. There's a clean level of UQ that's acceptable.

Glad to see that you share the same grievance as I do. These problems deserve much more attention than they receive right now.

I feel the issue with generalization theory is the obsession with universality and minimal assumption. There is something unreasonably appealing about having a theory that applies to any learnable problem one can come up with (e.g., the VC theory), but such theory necessarily comes with pessimism which leads to all sorts of vacuous/impossiblity results. However, compared to the space of all problems, the space of problems produced by nature is extremely small and highly structured, and there clearly exist algorithms that generalize well on these problems despite the discouraging results from theory (kind of a roundabout way of saying that neural networks are not meant to work on every problem but they work very well on the problems that we care about).

Personally, I believe the path forward for generalization theory is to make it closer to natural science. First, we make assumptions (maybe strong ones) about the data and then verify that these assumptions are consistent with nature. Then we prove theorems with these stronger assumptions which hopefully would give us more useful conclusions. Unfortunately this will make the theory less universal but it may be a good price to pay for theories that are more reflective of reality. Do you think this is a reasonable perspective?

What is the core obstruction to formulating practically useful generalization theorems? It seems we could do better if we understood the relationship between the data we have today and the data we expect to have tomorrow. We have two default models of how past and future data are related: iid and small distribution shift. Are there any promising alternatives?

Or are you arguing that "even if iid were true, we still can't explain generalization due to weird phenomena like interpolation on the training set with number of parameters >> number of samples?" I don't buy that this is the issue. If we're in the IID setting, a model with a small Lipschitz constant + sufficiently dense cover of sample space will generalize well. Do you have any references that show a mapping with a huge Lipschitz constant but a poor cover of the sample space still generalizes well? That would be more convincing.

What does it mean to cover the sample space? On MNIST, there are 6,000 images per class. Why is 6,000 sufficient to cover a 784 dimensional space well space well? What would the Lipschitz constant need to be here in order to get around some curse of dimensionality?

Moving beyond MNIST only makes matters worse. In the ImageNet benchmark there are 1,000 examples per class. Each image has hundreds of thousands of pixels. In what way do we cover the space?

I think the margin arguments are more compelling, but even there, the argument is that if you pick your features right, you'll have large margin. In fact, a large margin perspective argues for you to pick high dimensional representations as data will become more separated in higher dimensions. But it's not particularly prescriptive for which transformations get you better margin.

In the latter part of my comment, I should have stated, "Supposing a superpopulation exists and we can sample IID from it," which, as you point out, is doubtful with MNIST and other benchmarks. But even in MNIST, the exact dimension of the space appears irrelevant if we know a priori that most elements of the test set are near a training example and the mapping we fit has a small (local) Lipschitz constant near each sample (+interpolation). (Incidentally, both the uniform convergence and the stability approaches are "Lipschitz => behavior in training sample predicts behavior out of sample" results. In the former case, we need a cover of the space of samples, and there, you run into the curse of dimensionality. In the latter case, stability is a kind of Lipschitz continuity on the space of sequences with a discrete metric that takes on value infinity for sequences that differ in more than one element.)

Returning to my original question: do you buy the statement that "we can only provide practically useful theorems if we understand the relationship between past and future data?" If so, when, if ever, do we understand this relationship? And when we admit that we don't, what do we do instead, fit 1000 models and pick the best, but admit that we have no guarantees and are likely never to produce one?

On Twitter, Lily-Belle Sweet brought up a great question. How do we justify leave-one-out error or the holdout method or cross validation without an appeal to i.i.d.?

Here’s my take on this. When we collect a training data set, it serves as a population itself. If we subsample from this training data, then the i.i.d. assumption holds because we are being intentional. Hence, bootstrap methods are telling us something about internal validity. They are telling us something about how the method performs when the superpopulation is our training set. Then, to generalize beyond this to new data, we just have to convince ourselves that the training set is a representative sample of the data we’ll see.

Thoughts?

Great response, thanks! I agree that bootstrap methods tell us about internal validity. But this is never the question we're asking, so 'just' convincing ourselves that the training set is representative is actually a pretty big ask, no? What does 'representative' really mean here, and how can we convince ourselves of that?

This one I don't have a good answer to. But I'll say that this problem is pervasive in science! How do we know the results of a randomized clinical trial will be transportable to new patients? This is basically the same question, right?

Yep, big question - let me rephrase my thoughts a bit. Given that data isn't i.i.d, another way to assess generalisation performance is to consider that in your test data selection instead of bootstrapping, right? E.g., for temporal data: bootstrapping would be enough in terms of measuring training set performance, but using a temporal split gives us a better measurement of generalisation (because the model may rely on the temporal autocorrelation). And of course there can be umpteen more confounders and all kinds of mess going on under-the-hood...

So if we want to generalise beyond the training set, isn't the question not 'is our data set representative?' but 'is our test set as difficult as unseen data could be'?

Totally agree with these examples. I think both questions are important and can't be ignored. But what is the theory that dictates how to do a good cross-validation split? Or what is the theory that tells us which features should be used for prediction? These are the sorts of questions I'm going to try to partially answer in the next couple of lectures. The answer has to involve bringing assembled knowledge (dare I say priors) to the data collection, measurement, and benchmarking. But I'm not convinced that any of these questions are answered by insights from what we typically construe as Mathematical Statistics.

The two questions you raise are exactly the topics I'm interested in, particularly in the case where we don't know the underlying causal process! And to your last point - I am fairly new to generalization theory but have had the same hunch, so it's validating to see that you are also not convinced.

"If our theory gives bad advice for practice, we have to come clean and admit our theory has failed."

I agree with this, of course. It is important to take the next step though and state that we need new theory, consistent with empirical evidence.

Btw, I also find uniform-type bounds quite beautiful. That is a big part of their appeal, I suppose.

VC theory is beautiful! But lots of physicists think String Theory is beautiful too.

That said, I'm all in for some new practically minded theory.

I have enjoyed reading "Lost in Math: How Beauty Leads Physics Astray" by Sabine Hossenfelder. It is interesting how even in physics certain ideas can continue to be pursued even when empirical validation is elusive.

If we do not develop theory, it is the end of science, I am afraid.

> The issue with a lot of statistical theory is that we take metaphysical beliefs (e.g., data is i.i.d.), turn these beliefs into mathematical axioms, and then forget that these beliefs weren't particularly grounded in the first place.

Reminds me of a quote I read yesterday: 'Everybody believes in the [normal approximation], the experimenters because they think it is a mathematical theorem, the mathematicians because they believe it is an experimental fact.' -- G. Lippmann

That’s a good one. There’s an awful lot of evidence that bell curves are paradigmatic rather than natural consequence. This article is compelling: https://www.journals.uchicago.edu/doi/10.1093/bjps/axs046

edited Sep 14, 2023As you know from our multiple conversations, I personally view the statistical approach as fundamentally broken. It's based on poor epistemology and poor understanding of probability (in particular, on the conflation of the object language of empirical averages and the metalanguage of expected values with respect to a probability measure). On the other hand, there is a sensible pragmatist defense, which requires a great deal of commitment to rigorous testing, verification, and, dare we say, maintenance. In the context of ML systems, maintenance means regularly checking whether the nominal environment for which the system has been designed (in our case, the "superpopulation") is still an adequate representation of the actual operating conditions. Am I going to quote Willems again? You bet I am:

"In engineering (and prescriptive aspects of economics) one can, it seems to me, take the following intermediate position. An algorithm-based engineering device, say in signal processing, communication, or control, comes with a set of ‘certificates’, i.e. statements that guarantee that the device or the algorithm will work well under certain specified circumstances. These circumstances need not be the ones under which the device will operate in actual practice. They may not even be circumstances which can happen in the real world. These certificates are merely quality specifications. Examples of such performance guarantees may be that an error correcting code corrects an encoded message that is received with on the average not more than a certain percentage of errors, or that a filter generates the conditional expectation of an unobserved signal from an observed one under certain prescribed stochastic assumptions, or that a controller ensures robust stability if the plant is in a certain neighborhood of a nominal one, etc."

Max, take this last bit:

"Examples of such performance guarantees may be that an error correcting code corrects an encoded message that is received with on the average not more than a certain percentage of errors, or that a filter generates the conditional expectation of an unobserved signal from an observed one under certain prescribed stochastic assumptions, or that a controller ensures robust stability if the plant is in a certain neighborhood of a nominal one, etc."

and tell me what you think Willems might say about machine learning?

My take is that he would probably suggest the use of certificates based on something like this: https://arxiv.org/abs/1009.0679 (I liked that paper when it came out, it is probably worth revisiting).

Also, going along with your prescription to start with linear models to understand what the hell is going on, this book by Willems and collaborators seems like something that we should pay close attention to:

https://www.amazon.com/Exact-Approximate-Modeling-Linear-Systems/dp/0898716039

"The topic of this book is fitting models to data. We would like the model to fit the data exactly; however, in practice often the best that can be achieved is only an approximate fit. A fundamental question in approximate modeling is how to quantify the lack of fit between the data and the model. In this chapter, we explain and illustrate two different approaches for answering this question.

The first one, called latency, augments the model with additional unobserved variables that allow the augmented model to fit the data exactly. Many classical approximate modeling techniques such as the least squares and autoregressive moving average exogenous (ARMAX) system identification methods are latency oriented methods. The statistical tool corresponding to the latency approach is regression.

An alternative approach, called misfit, resolves the data-model mismatch by correcting the data, so that it fits the model exactly. The main example of the misfit approach is the total least squares method and the corresponding statistical tool is errors-in-variables regression."

edited Sep 14, 2023I think steps 1-3 that I propose (in the next blog here: https://argmin.substack.com/p/features-of-the-foundations-of-machine ) would be in line with the Willems ethos. I.e., treating the data as a superpopulation. What do you think?

I am inclined to agree.

Yeah. Same.

With regards to model fitting, there's a different question when fitting a dynamical system for control vs fitting a convnet for discriminating dog breeds. That is, we know we just need to fit the dynamical system well enough that it can be robustly controlled. There's a clean level of UQ that's acceptable.

Anyway, now I have to get another Willems book.

edited Sep 13, 2023Glad to see that you share the same grievance as I do. These problems deserve much more attention than they receive right now.

I feel the issue with generalization theory is the obsession with universality and minimal assumption. There is something unreasonably appealing about having a theory that applies to any learnable problem one can come up with (e.g., the VC theory), but such theory necessarily comes with pessimism which leads to all sorts of vacuous/impossiblity results. However, compared to the space of all problems, the space of problems produced by nature is extremely small and highly structured, and there clearly exist algorithms that generalize well on these problems despite the discouraging results from theory (kind of a roundabout way of saying that neural networks are not meant to work on every problem but they work very well on the problems that we care about).

Personally, I believe the path forward for generalization theory is to make it closer to natural science. First, we make assumptions (maybe strong ones) about the data and then verify that these assumptions are consistent with nature. Then we prove theorems with these stronger assumptions which hopefully would give us more useful conclusions. Unfortunately this will make the theory less universal but it may be a good price to pay for theories that are more reflective of reality. Do you think this is a reasonable perspective?

What is the core obstruction to formulating practically useful generalization theorems? It seems we could do better if we understood the relationship between the data we have today and the data we expect to have tomorrow. We have two default models of how past and future data are related: iid and small distribution shift. Are there any promising alternatives?

Or are you arguing that "even if iid were true, we still can't explain generalization due to weird phenomena like interpolation on the training set with number of parameters >> number of samples?" I don't buy that this is the issue. If we're in the IID setting, a model with a small Lipschitz constant + sufficiently dense cover of sample space will generalize well. Do you have any references that show a mapping with a huge Lipschitz constant but a poor cover of the sample space still generalizes well? That would be more convincing.

What does it mean to cover the sample space? On MNIST, there are 6,000 images per class. Why is 6,000 sufficient to cover a 784 dimensional space well space well? What would the Lipschitz constant need to be here in order to get around some curse of dimensionality?

Moving beyond MNIST only makes matters worse. In the ImageNet benchmark there are 1,000 examples per class. Each image has hundreds of thousands of pixels. In what way do we cover the space?

I think the margin arguments are more compelling, but even there, the argument is that if you pick your features right, you'll have large margin. In fact, a large margin perspective argues for you to pick high dimensional representations as data will become more separated in higher dimensions. But it's not particularly prescriptive for which transformations get you better margin.

In the latter part of my comment, I should have stated, "Supposing a superpopulation exists and we can sample IID from it," which, as you point out, is doubtful with MNIST and other benchmarks. But even in MNIST, the exact dimension of the space appears irrelevant if we know a priori that most elements of the test set are near a training example and the mapping we fit has a small (local) Lipschitz constant near each sample (+interpolation). (Incidentally, both the uniform convergence and the stability approaches are "Lipschitz => behavior in training sample predicts behavior out of sample" results. In the former case, we need a cover of the space of samples, and there, you run into the curse of dimensionality. In the latter case, stability is a kind of Lipschitz continuity on the space of sequences with a discrete metric that takes on value infinity for sequences that differ in more than one element.)

Returning to my original question: do you buy the statement that "we can only provide practically useful theorems if we understand the relationship between past and future data?" If so, when, if ever, do we understand this relationship? And when we admit that we don't, what do we do instead, fit 1000 models and pick the best, but admit that we have no guarantees and are likely never to produce one?

Glad your students and I agree you're too harsh about this.

What's your positive case for the theory? It has to be more than "bigger n is better."

it tells you should split in training and test :)

LOLOL. But if you click the link, Bill Highleyman figured that out with the wrong theory. https://ieeexplore.ieee.org/document/6768949

I assume you’ve both seen Chervonenkis’ slightly risqué comment on the nature of Highleyman’s mistake?

you assume wrong

pass it on

https://link.springer.com/chapter/10.1007/978-3-319-21852-6_1

ctrl-f Highleyman