The graph looks intuitive to engineers like me since we are used to facing trade-off’s at almost every physical problem. For me, the bias-variance trade off appears as a parameter estimation problem in my mind: Given finite data, it is “more difficult” to reliably estimate many parameters than just a few parameters. Therefore, advice1: we should keep the number of parameters bounded. If we have very few parameters, we can reliably estimate them; but, the model will be crude and not useful with very few parameters. Hence, Advice2: have a lower bound on the number of parameters as well. The literature of AIC (Akaike’s information criterion), BIC on model order selection tries to do this selection systematically for us. These folks received awards for their works and it was our religion in engineering.
I believe our mental shortcoming stems from considering small data regime as the only possible regime of operation. The main issue especially in physical problems has almost always been making the best use of very limited data. The abundance of training data has lead to interpolation type results (with/without noise) and lead to this new school or church!
the easiest way to see the problem with this figure is being forced to teach it. students (in my class at least) would immediately see it was bs and ask the obvious questions about model complexity that you listed here. I was always left feeling deeply unsatisfied with teaching this lecture and the many after where we reasoned about whether the bias or variance would increase / decrease if we adjusted some hyperparameter \lambda.
i wasn't using ESL btw, but ISL, since it was an undergrad course. i inherited the course and taught it for several years. luckily i will not teach that course again :p.
YES exactly. The plot is in ISL too. I first taught ML in fall of 2016, and the next five years of my research career was writing papers about why that plot (and a lot of other stuff in ISL) is not only wrong, but bad advice.
i'm in. i was going to suggest something similar after reading your post.
i really like starting from the assumption that pattern recognition works (and optimization works -- in fact that's the first line of my course overview this semester). can get to cool stuff quickly and not feel like you're wasting everyones' time!
i'll try not to respond in essay form, but i'm leaving myself open to infinite caveats.
the short answer: in ml, gradient descent—and related methods—plus a bit of dynamic programming are all you need. (didn't you write a blog post on this?) if hyperparameter tuning and multiple random initializations fail to find minima, then either your problem is f-hard (perhaps due to missing critical info [0]) or it's been formulated incorrectly. i realize this is a tautological explanation of an empirical observation.
my course takes this viewpoint seriously. if it's true, students need to know which building blocks yield 'solvable' formulations, how to implement them in code, run the code efficiently, diagnose failures, and improve the methods. we've roughly got answers to these questions in ml, so it's a good time to write them down.
[0] one interesting example is holographic phase retrieval (https://arxiv.org/pdf/1901.06453). phase retrieval with fourier measurements is hard, but adding a small reference makes the problem tractable. you can even implement this experimentally and recover the signal with gradient descent (though the paper’s linear algebra method is brittle).
It's really hard to pin down hardness in nonlinear programming. One of my graduate students is working on an optimization formulation that is much hairier than we expected.
However, we don't believe in the reality of models in machine learning. So you have a lot more flexibility than you do in other optimization contexts.
Maybe it goes something like this in undergrad ML:
- There are a lot of optimization methods out there, but SGD works for 95% of the problems in ML.
- Here are tips and tricks to get SGD to work for you.
- If SGD doesn't work on your problem, you might have to adjust your problem to make SGD happier.
- If you really want to learn how to move beyond SGD, here's the number of the course to take after ML
Hi Ben, how do you obtain more parameters than examples in your experiment? It seems to me that if you are using a linear model, that would imply that you also need more features. Are you using the same random fourier features scheme as in Belkin et al., or is there something else going on here? Or am I off completely? Thank you!
In a more recent installment of another Hastie, Tibshirani et. al. product (ISLP) they acknowledge double descent but still defend the value of the bias-variance trade-off.
From chapter 10.8 of ISLP:
"To summarize: though double descent can sometimes occur in neural networks, we typically do not want to rely on this behavior. Moreover, it is important to remember that the bias-variance trade-off always holds (though it is possible that test error as a function of flexibility may not exhibit a U-shape, depending on how we have parametrized the notion of “flexibility” on the x-axis)."
It is probably fair to ask what good such advice is, when one doesn't know how to parametrize the x-axis.
“ On a plot that radicalized me like no other” - what humility from a great researcher and educator!
I think sadly this fiction will endure for longer. The problem for many is that tricks for stabilizing optimization or somehow finding the low estimation error solution seems like “regularization” which people assume is synonymous to lower complexity.
You can’t destroy faith without giving an alternative explanation of empirical practice ;)
Two classic problems here which led us astray for a bit - bad analogies and logical fallacies:
Let us eschew the terms bias and variance for a bit and call them approximation and estimation error. This lays bare the fact that there needn’t be a tradeoff. Approximation error is clearly limited by richness of the function class (irreducible) but estimation error is both a function of the model class AND the fitting procedure - There is no reason why estimation error must increase with whatever notion of complexity we invent.
The theory is not wrong - it is merely pessimistic. In fact, it says if you have a simple hypothesis class and a large number of i.i.d examples that both train and test are taken from, THEN you can make the test error close to train error if some notion of complexity is small. BUT, we are *denying the antecedent* if we misread it to mean that we absolutely cannot have low estimation error with complex models. Classic logical fallacy!
And like Ben said we could even believe that the bound is _tight_ if we denied the existence of holdout!
>> “ Let us eschew the terms bias and variance for a bit and call them approximation and estimation error.” <<
These are not equivalent, so you can’t just call them that.
Bias is a well defined (if overloaded) term, as is variance, as are approximation and estimation error. But they’re not the same thing.
By the way - estimation error is not a function of the fitting procedure - that’s “optimization error”. The estimation error is the difference of risks for the empirical risk minimizer versus the best-in-family model.
the origin of every brainworm? It used to be a heuristic or a toy or just a trick we did. Somehow, it was really intellectually satisfying (beautiful, simple, compelling, etc). Also, it worked really well sometimes! So, we started teaching it and somehow the "its just a heuristic/toy/trick/thing" component gets lost in teaching the beauty/simplicity/etc.
I have forever hated the lack of a formal definition of model complexity in HTF’s book. As an information-theorist it’s particularly hard to process this informality. That said, I still teach it in the context of linear regression. I use it to highlight the importance of test data as distinct and different from training data and even the need for validation as model complexity (say, the number of features used via a regularizer) grows. We really need updated books or ways to incorporate complex papers in simple ways.
Hmmm. This post started off seeming like it was going to rock my world, "everything you thought about Bias-Variance Tradeoff was wrong!!!" But then it just points at Double Descent, which we all knew about going in. Nothing wrong with this post but I was left a bit underwhelmed :D
Some sort of regularization is happening in the second figure when the model complexity is large. Since the solution is not unique, probably a minimum norm solution is chosen or something like that?
I think it highly depends how you define model complexity. My best ranking system has over 175 features / factors, since our ranking systems do not have a normal distribution assumption and capture non linearity very well, it works super robust (especially on small caps, you need a lot of features, because you got a lot of NA Data, so to fish your net needs to be big!)... Best Regards Andreas
I get why you say this in regard of deep learning - where most theory breaks down - but why is this not ok to teach for decision trees, k-nn, , logistic regression, all the things a good data scientist should have in their toolkit? Double Descent is actually pretty hard to make appear in linear regression, without synthetic data in high dimensions.
I don't know how to respond. You've posted three vague comments on this post, but I'm not sure what point you are trying to argue. Perhaps you could write up a longer response so that it's clearer what you mean.
Sorry. Ok - for this post, I meant that the BV trade off still holds for trees and k-nearest neighbour models, so it’s still worth teaching. Obviously much more controversial for neural nets.
On the other post, I thought you were implying we MUST use squared loss to have a decomposition. This is not true as it holds for many losses - not just squared - eg cross-entropy, and more generally any Bregman divergence.
On the other other post (sorry again) I was responding to the commenter, as he was saying we can “call them” approx/estim. That’s just not correct either - https://openreview.net/pdf?id=4TnFbv16hK
I think more generally, the point about the ill-defined “complexity” on the x-axis is absolutely valid. However with approximation-estimation, the x-axis is not complexity, but the size of the hypothesis class - which are related of course, but a small sized class, full of complex hypotheses, is entirely possible. And in fact that’s exactly what I think happens in deep learning - the implicit regularisation of SGD means the effective size is smaller, though the members of that hypothesis class are representationally complex.
So, if I were to argue any point, these decompositions have a role, but we don’t know how to measure the effective size of a hypothesis class in deep learning.
My apologies for being vague - not used to this platform and was treating it like Twitter. Really like your posts on both anyhow.
Thanks and no worries! What I like about Substack is that we can have rather detailed and lengthy conversations without having to worry about Twitter-esque concept collapse. A longer conversation is needed about how I’m celebrating a return to blog comments in 2025.
I agree that the way you’ve defined the estimation-approximation tradeoff in your paper is not the same as the bias-variance tradeoff. And I prefer the E-A decomposition as it doesn’t require any assumptions at all.
But, I’ve stared at conclusions derived from E-A decompositions for over 20 years, and I have come away dissatisfied. Not only do I think they don’t tell us much about deep learning, but I don’t think they tell us much about “shallow” learning either. For estimation error, the conclusion is rarely deeper than “n should be as large as possible.” And “you should pick a model class where your margin is large.” Moreover, I have no idea what we can ever say about approximation error in any applied, purely data-driven setting. (For whatever it’s worth, Vapnik also holds a very dim view of estimating of approximation error.)
In any event, I see what you’re saying, and I understand that these decompositions have helped clarify theory questions. However, I haven’t personally found an instance where they’ve helped me understand how to do anything. More troubling, when I’ve heeded the advice derived from these bounds, I ended up sacrificing predictive performance.
Indeed - I now see how it works and recognise that “concept collapse” has happened there before ;-)
Well - I guess if you’re saying the BV or AE decompositions don’t give immediate and ongoing practical guidance, then yes, that’s true. But surely we can say they are good conceptual tools, that have in the past provided a thought framework which inspired certain concrete actions? Where did the idea of L2 regularisation come from historically? How did Hinton come up with dropout? Why did Breiman add that extra stochastic element to Bagging, making it into Random Forests? I’d say with all of those, the inventor was considering variance reduction. To know that variance reduction is a good thing, we need the bias/variance decomposition.
Perhaps they will inspire future innovations just as a “way to think”. Maybe the literature around flat minima can come up with a way to estimate how many there are, and thus provide some guidance on architectures that will reduce the effective size of the hypothesis class.
Lots of maybes I know ;-). All that said, I do just like these sort of decompositions for their beautiful (if massively simplified) treatment of a complex phenomenon.
Maybe the decompositions can be allowed in a syllabus just because they’re beautiful - is a theory more likely to be correct if it is beautiful ? (quote paraphrased from Murray Gell-Mann)
But wouldn’t you agree that bias/variance gave a thought framework on which Random Forests and Dropout were built? So it is a useful tool in that sense.
The graph looks intuitive to engineers like me since we are used to facing trade-off’s at almost every physical problem. For me, the bias-variance trade off appears as a parameter estimation problem in my mind: Given finite data, it is “more difficult” to reliably estimate many parameters than just a few parameters. Therefore, advice1: we should keep the number of parameters bounded. If we have very few parameters, we can reliably estimate them; but, the model will be crude and not useful with very few parameters. Hence, Advice2: have a lower bound on the number of parameters as well. The literature of AIC (Akaike’s information criterion), BIC on model order selection tries to do this selection systematically for us. These folks received awards for their works and it was our religion in engineering.
I believe our mental shortcoming stems from considering small data regime as the only possible regime of operation. The main issue especially in physical problems has almost always been making the best use of very limited data. The abundance of training data has lead to interpolation type results (with/without noise) and lead to this new school or church!
Yes, the issue is that Occam's Razor is not a universally useful engineering heuristic.
the easiest way to see the problem with this figure is being forced to teach it. students (in my class at least) would immediately see it was bs and ask the obvious questions about model complexity that you listed here. I was always left feeling deeply unsatisfied with teaching this lecture and the many after where we reasoned about whether the bias or variance would increase / decrease if we adjusted some hyperparameter \lambda.
i wasn't using ESL btw, but ISL, since it was an undergrad course. i inherited the course and taught it for several years. luckily i will not teach that course again :p.
YES exactly. The plot is in ISL too. I first taught ML in fall of 2016, and the next five years of my research career was writing papers about why that plot (and a lot of other stuff in ISL) is not only wrong, but bad advice.
You and I should find a way to design this class together: https://www.argmin.net/p/machine-learning-101
i'm in. i was going to suggest something similar after reading your post.
i really like starting from the assumption that pattern recognition works (and optimization works -- in fact that's the first line of my course overview this semester). can get to cool stuff quickly and not feel like you're wasting everyones' time!
ooh, what do you mean by "optimization works?" I'm intrigued.
i'll try not to respond in essay form, but i'm leaving myself open to infinite caveats.
the short answer: in ml, gradient descent—and related methods—plus a bit of dynamic programming are all you need. (didn't you write a blog post on this?) if hyperparameter tuning and multiple random initializations fail to find minima, then either your problem is f-hard (perhaps due to missing critical info [0]) or it's been formulated incorrectly. i realize this is a tautological explanation of an empirical observation.
my course takes this viewpoint seriously. if it's true, students need to know which building blocks yield 'solvable' formulations, how to implement them in code, run the code efficiently, diagnose failures, and improve the methods. we've roughly got answers to these questions in ml, so it's a good time to write them down.
[0] one interesting example is holographic phase retrieval (https://arxiv.org/pdf/1901.06453). phase retrieval with fourier measurements is hard, but adding a small reference makes the problem tractable. you can even implement this experimentally and recover the signal with gradient descent (though the paper’s linear algebra method is brittle).
Got it. I'm down with this definition.
It's really hard to pin down hardness in nonlinear programming. One of my graduate students is working on an optimization formulation that is much hairier than we expected.
However, we don't believe in the reality of models in machine learning. So you have a lot more flexibility than you do in other optimization contexts.
Maybe it goes something like this in undergrad ML:
- There are a lot of optimization methods out there, but SGD works for 95% of the problems in ML.
- Here are tips and tricks to get SGD to work for you.
- If SGD doesn't work on your problem, you might have to adjust your problem to make SGD happier.
- If you really want to learn how to move beyond SGD, here's the number of the course to take after ML
yeah, I agree with this. would be curious to hear about the nonlinear programming problem as well
Hi Ben, how do you obtain more parameters than examples in your experiment? It seems to me that if you are using a linear model, that would imply that you also need more features. Are you using the same random fourier features scheme as in Belkin et al., or is there something else going on here? Or am I off completely? Thank you!
The plots that I made use a model from this paper on double descent:
https://arxiv.org/abs/1903.07571
y = sum_k x_k beta_k + e
All of the x_k are normal with mean zero and variance 1. beta_k = 1/k. e is normal with mean zero and variance 0.25.
In a more recent installment of another Hastie, Tibshirani et. al. product (ISLP) they acknowledge double descent but still defend the value of the bias-variance trade-off.
From chapter 10.8 of ISLP:
"To summarize: though double descent can sometimes occur in neural networks, we typically do not want to rely on this behavior. Moreover, it is important to remember that the bias-variance trade-off always holds (though it is possible that test error as a function of flexibility may not exhibit a U-shape, depending on how we have parametrized the notion of “flexibility” on the x-axis)."
It is probably fair to ask what good such advice is, when one doesn't know how to parametrize the x-axis.
oh no! you might say they are "doubling down."
“ On a plot that radicalized me like no other” - what humility from a great researcher and educator!
I think sadly this fiction will endure for longer. The problem for many is that tricks for stabilizing optimization or somehow finding the low estimation error solution seems like “regularization” which people assume is synonymous to lower complexity.
You can’t destroy faith without giving an alternative explanation of empirical practice ;)
Two classic problems here which led us astray for a bit - bad analogies and logical fallacies:
Let us eschew the terms bias and variance for a bit and call them approximation and estimation error. This lays bare the fact that there needn’t be a tradeoff. Approximation error is clearly limited by richness of the function class (irreducible) but estimation error is both a function of the model class AND the fitting procedure - There is no reason why estimation error must increase with whatever notion of complexity we invent.
The theory is not wrong - it is merely pessimistic. In fact, it says if you have a simple hypothesis class and a large number of i.i.d examples that both train and test are taken from, THEN you can make the test error close to train error if some notion of complexity is small. BUT, we are *denying the antecedent* if we misread it to mean that we absolutely cannot have low estimation error with complex models. Classic logical fallacy!
And like Ben said we could even believe that the bound is _tight_ if we denied the existence of holdout!
>> “ Let us eschew the terms bias and variance for a bit and call them approximation and estimation error.” <<
These are not equivalent, so you can’t just call them that.
Bias is a well defined (if overloaded) term, as is variance, as are approximation and estimation error. But they’re not the same thing.
By the way - estimation error is not a function of the fitting procedure - that’s “optimization error”. The estimation error is the difference of risks for the empirical risk minimizer versus the best-in-family model.
It's a brainworm!
-----
the origin of every brainworm? It used to be a heuristic or a toy or just a trick we did. Somehow, it was really intellectually satisfying (beautiful, simple, compelling, etc). Also, it worked really well sometimes! So, we started teaching it and somehow the "its just a heuristic/toy/trick/thing" component gets lost in teaching the beauty/simplicity/etc.
Then, becomes dogma.
I have forever hated the lack of a formal definition of model complexity in HTF’s book. As an information-theorist it’s particularly hard to process this informality. That said, I still teach it in the context of linear regression. I use it to highlight the importance of test data as distinct and different from training data and even the need for validation as model complexity (say, the number of features used via a regularizer) grows. We really need updated books or ways to incorporate complex papers in simple ways.
Agreed... the problem is that machine learning is so decoupled from our theories that it's hard to write a satisfactory text.
Hmmm. This post started off seeming like it was going to rock my world, "everything you thought about Bias-Variance Tradeoff was wrong!!!" But then it just points at Double Descent, which we all knew about going in. Nothing wrong with this post but I was left a bit underwhelmed :D
Some sort of regularization is happening in the second figure when the model complexity is large. Since the solution is not unique, probably a minimum norm solution is chosen or something like that?
What is regularization?
I think it highly depends how you define model complexity. My best ranking system has over 175 features / factors, since our ranking systems do not have a normal distribution assumption and capture non linearity very well, it works super robust (especially on small caps, you need a lot of features, because you got a lot of NA Data, so to fish your net needs to be big!)... Best Regards Andreas
The confusion is due to the overloading of terms: models and parameters.
When the concept of bias/variance was introduced by statisticians, they all agreed on formal definitions of those terms.
Model: was the set of generative distributions under consideration.
Parameter: was a multivariable index of the distributions in the model.
For example, one model would be univariate Gaussian (with 2 parameters), another model would be bivariate Gaussian (with 5 parameters).
With those definitions, everything works out perfectly, in the math and in practice.
Then, ML people overloaded the terms.
Model: any function mapping from input to output.
Parameter: any knob one can tune in a model, eg, number of hidden units.
Once the vocabulary is muddied, fundamental truths become confusing.
But they are still true.
I'm sorry, I don't understand your point. Which theorem is true in your gaussian example?
I'm a novice in machine learning if not ESL which book should i opt to start my learning.
I get why you say this in regard of deep learning - where most theory breaks down - but why is this not ok to teach for decision trees, k-nn, , logistic regression, all the things a good data scientist should have in their toolkit? Double Descent is actually pretty hard to make appear in linear regression, without synthetic data in high dimensions.
I don't know how to respond. You've posted three vague comments on this post, but I'm not sure what point you are trying to argue. Perhaps you could write up a longer response so that it's clearer what you mean.
Sorry. Ok - for this post, I meant that the BV trade off still holds for trees and k-nearest neighbour models, so it’s still worth teaching. Obviously much more controversial for neural nets.
On the other post, I thought you were implying we MUST use squared loss to have a decomposition. This is not true as it holds for many losses - not just squared - eg cross-entropy, and more generally any Bregman divergence.
On the other other post (sorry again) I was responding to the commenter, as he was saying we can “call them” approx/estim. That’s just not correct either - https://openreview.net/pdf?id=4TnFbv16hK
I think more generally, the point about the ill-defined “complexity” on the x-axis is absolutely valid. However with approximation-estimation, the x-axis is not complexity, but the size of the hypothesis class - which are related of course, but a small sized class, full of complex hypotheses, is entirely possible. And in fact that’s exactly what I think happens in deep learning - the implicit regularisation of SGD means the effective size is smaller, though the members of that hypothesis class are representationally complex.
So, if I were to argue any point, these decompositions have a role, but we don’t know how to measure the effective size of a hypothesis class in deep learning.
My apologies for being vague - not used to this platform and was treating it like Twitter. Really like your posts on both anyhow.
Thanks and no worries! What I like about Substack is that we can have rather detailed and lengthy conversations without having to worry about Twitter-esque concept collapse. A longer conversation is needed about how I’m celebrating a return to blog comments in 2025.
I agree that the way you’ve defined the estimation-approximation tradeoff in your paper is not the same as the bias-variance tradeoff. And I prefer the E-A decomposition as it doesn’t require any assumptions at all.
But, I’ve stared at conclusions derived from E-A decompositions for over 20 years, and I have come away dissatisfied. Not only do I think they don’t tell us much about deep learning, but I don’t think they tell us much about “shallow” learning either. For estimation error, the conclusion is rarely deeper than “n should be as large as possible.” And “you should pick a model class where your margin is large.” Moreover, I have no idea what we can ever say about approximation error in any applied, purely data-driven setting. (For whatever it’s worth, Vapnik also holds a very dim view of estimating of approximation error.)
In any event, I see what you’re saying, and I understand that these decompositions have helped clarify theory questions. However, I haven’t personally found an instance where they’ve helped me understand how to do anything. More troubling, when I’ve heeded the advice derived from these bounds, I ended up sacrificing predictive performance.
Indeed - I now see how it works and recognise that “concept collapse” has happened there before ;-)
Well - I guess if you’re saying the BV or AE decompositions don’t give immediate and ongoing practical guidance, then yes, that’s true. But surely we can say they are good conceptual tools, that have in the past provided a thought framework which inspired certain concrete actions? Where did the idea of L2 regularisation come from historically? How did Hinton come up with dropout? Why did Breiman add that extra stochastic element to Bagging, making it into Random Forests? I’d say with all of those, the inventor was considering variance reduction. To know that variance reduction is a good thing, we need the bias/variance decomposition.
Perhaps they will inspire future innovations just as a “way to think”. Maybe the literature around flat minima can come up with a way to estimate how many there are, and thus provide some guidance on architectures that will reduce the effective size of the hypothesis class.
Lots of maybes I know ;-). All that said, I do just like these sort of decompositions for their beautiful (if massively simplified) treatment of a complex phenomenon.
Maybe the decompositions can be allowed in a syllabus just because they’re beautiful - is a theory more likely to be correct if it is beautiful ? (quote paraphrased from Murray Gell-Mann)
We'd all like beautiful theories to be true, but history is littered with a lot of beauty that didn't pan out. We muddle through regardless!
But wouldn’t you agree that bias/variance gave a thought framework on which Random Forests and Dropout were built? So it is a useful tool in that sense.
I’m not (yet) disagreeing with the premise of this post, but….
> “ First, we must measure prediction error in the square loss.” <<
Bias/Variance holds for many losses beyond just squared loss.
Isn’t Vapnik–Chervonenkis dimension a good general measure of complexity? What happens when it is used in the x axis?