This reminds me of another problematic definition - "transfer learning". NSF Grant on transfer learning funded my PhD studies. But I haven't figured out what it means
I think most of these general definitions fail to focus on why this process occurs--which imo is due to an overly complex model that tries to interpolate the noise. Ignoring confounding aspects (which could lead to false generalization in some sense), an example would be if the underlying data generating process was some linear model with iid noise, but the fitted model was some DNN.
Sure, those are means to solve/reduce overfitting for those scenarios, but that doesn't mean that definition of overfitting fails to capture those scenarios in the first place.
Defining what is overfitting and describing how to address it are separate tasks!
It feels like your primary problem is the simplistic linguistics or "rules of thumb" that were probably designed to cater to even the bottom quintile of students/practitioners. But maybe it's more apparent as some sort of problematic dogma in your surroundings than what appears in mine. In my experience, most know that poor generalization performance isn't obviously directly due to one of 10 things.
In grad school I was constructing RBF kernels on 50,000 features for < 250 examples and there were always people shouting CuRSe oF DimEnSioNALiTy!!!!11! I just kind of ignored them. The kernels had eigengaps and decaying spectra and that was fine for me.
The one-dimensional sinewave classifier (infinite VC dimension with one parameter!) is my favorite thought experiment for why counting parameters is pointless.
Since this became the topic of the day I decided to pull Tom Mitchell's book off my shelf to see if he has offered a definition. He does and it's mostly a formalized version of what Charles said:
"Given a hypothesis space H, a hypothesis h in H is said to overfit the training data if there exists some alternative hypothesis h' in H, such that h has smaller error than h' on the training examples, but h' has smaller error than h over the entire distribution of instances."
So yeah, use a validation set to pick the best hypothesis.
I'll also add that the nice thing about working in robotics is we always generate new test sets, i.e. when the robot actually does something, so I'm never really concerned with training for the test/validation set, etc. that folks bring up as a concern for trusting cross validation.
Yeah, most of my work is learning-based and usually couple with other non-learning components. My preferred evaluation criteria is to have some clearly stated task success test that we can say either it succeeded or not, e.g. the robot lifted the object and placed it on the desired shelf.
The difficult part is coming up with meaningful set of conditions to show desired generalization to things like novel object instances or environments not seen during training. Its something I'm interested in formalizing more. I think providing verification is one of the most important problems for future acceptance of robots with learned components.
I mostly agree with you, but I still think in very low signal to noise environments, specifically in trading, a lot of these still apply. Particularly when you are predicting on a longer time horizon, and thus have lower samples. HFTs and other pods predicting at shorter time horizons are moving away from this convention though. They're increasingly demonstrating success in applying deep learning to asset price prediction.
Every time I mention this topic, a common rebuttal is financial data. Unfortunately, I have no way of evaluating such assertions as trading data is always private
It's just gambling and rent extraction anyway. Pure waste of cognitive investment to worry about concerns brought up by folks whose application is predicting financial markets.
I love your blog, but I think this post is totally off base. Much of the success of deep learning has been discovering tricks (e.g., early stopping, dropout, SGD, convolutional and other forms of regularizing structure) to overcome overfitting. You might say "overfitting doesn't exist", but (I would argue) that's primarily because researchers have found ways to overcome it!
Consider a simple example: model a density with collection of delta functions, one for each datapoint in the training set. Are you going to say that such a model *isn't* overfit? (It's a cop out to say "no one would ever do this". Of course they wouldn't — precisely because such a model would suffer from extreme overfitting!)
I think the critique is more that overfitting is used as normative concept standing in for bad generalization properties of a method. The literature often suggest that if you avoid overfitting, your method will in turn (apriori) generalize well in all situations. But being nihilists about generalization theory (judging from previous posts I assume Ben to be one too), we know that this is an illusion. Thus the connection from overfitting to generalization is severed. And that leaves us with what? Observing that you didn't like the performance of a method on certain data - as in your example.
Thanks for linking your talk, you made many points in passing that have taken me a long time to figure out... I especially like how you point out that every inductive inference requires taking risks. Once I stated in a manuscript that inductions require judgment calls and these are hard - perhaps impossible - to encode in software. The reviewers were not impressed.
For an economist, what's obviously missing in this list is a theoretical rationale. Economists were way ahead of the curve in denouncing data mining (still a pejorative in the profession, unlike everywhere else). That partly reflected early experience with techniques like stepwise regression (at a time when other social scientist s thought 2*2 ANOVA was cool). But mostly it was the view that unless your results fitted into a theoretical framework they were, at best, curiosities.
Of particular relevance to machine learning was the rejection of discriminant analysis (the basic tool of AI), eve now in favour fo choice modelling. McFadden is the name to check here.
I can not follow the argument did not convince me, that there might be many reasons why model predictions for a new dataset are "not so good", therefore "conventional overfitting" does not exist.
In certain settings, empirical risk minimization fails to obtain minimax rates. Arguably, it would be natural to say that ERM is "overfitting" in these settings. That said, if overfitting is a property of a learning curve, then minimax is not the right notion, and we'd have to look at so-called universal rates, such as have been studied by Hanneke and others (https://openreview.net/forum?id=6cWDg9t3z5).
Do you have any useful resources that might help with understanding why these are terrible advice. Are there specific assumptions that stat learning theory make that do not apply to deep learning?
Can you recommend something similar to "Understanding deep learning requires rethinking generalization."[https://arxiv.org/pdf/1611.03530] that is more recent/addresses the problems stated?
> When the future turns out not to be like the past, machine learning can’t work!
Yes, but what about when the future *is* like the past, but performance is still bad? Maybe this should be the definition of overfitting. Great post.
This reminds me of another problematic definition - "transfer learning". NSF Grant on transfer learning funded my PhD studies. But I haven't figured out what it means
Same!
I think most of these general definitions fail to focus on why this process occurs--which imo is due to an overly complex model that tries to interpolate the noise. Ignoring confounding aspects (which could lead to false generalization in some sense), an example would be if the underlying data generating process was some linear model with iid noise, but the fitted model was some DNN.
My issue is that this problem is solved by the holdout method. And no one does machine learning without a test set.
Also, if you use a test set, a DNN will fit a linear model with iid noise.
Sure, those are means to solve/reduce overfitting for those scenarios, but that doesn't mean that definition of overfitting fails to capture those scenarios in the first place.
Defining what is overfitting and describing how to address it are separate tasks!
It feels like your primary problem is the simplistic linguistics or "rules of thumb" that were probably designed to cater to even the bottom quintile of students/practitioners. But maybe it's more apparent as some sort of problematic dogma in your surroundings than what appears in mine. In my experience, most know that poor generalization performance isn't obviously directly due to one of 10 things.
What are the more common factors people tend to attribute poor generalization to in your experience?
In grad school I was constructing RBF kernels on 50,000 features for < 250 examples and there were always people shouting CuRSe oF DimEnSioNALiTy!!!!11! I just kind of ignored them. The kernels had eigengaps and decaying spectra and that was fine for me.
The one-dimensional sinewave classifier (infinite VC dimension with one parameter!) is my favorite thought experiment for why counting parameters is pointless.
Since this became the topic of the day I decided to pull Tom Mitchell's book off my shelf to see if he has offered a definition. He does and it's mostly a formalized version of what Charles said:
"Given a hypothesis space H, a hypothesis h in H is said to overfit the training data if there exists some alternative hypothesis h' in H, such that h has smaller error than h' on the training examples, but h' has smaller error than h over the entire distribution of instances."
So yeah, use a validation set to pick the best hypothesis.
I'll also add that the nice thing about working in robotics is we always generate new test sets, i.e. when the robot actually does something, so I'm never really concerned with training for the test/validation set, etc. that folks bring up as a concern for trusting cross validation.
My issue with Mitchell's definition of overfitting is that you can have h and h' having *equal* train error and different test error.
I agree with you about robotics, of course. Do you do a lot of machine learning for robotics? If so, I'm curious what your evaluation practices are.
Yeah, most of my work is learning-based and usually couple with other non-learning components. My preferred evaluation criteria is to have some clearly stated task success test that we can say either it succeeded or not, e.g. the robot lifted the object and placed it on the desired shelf.
The difficult part is coming up with meaningful set of conditions to show desired generalization to things like novel object instances or environments not seen during training. Its something I'm interested in formalizing more. I think providing verification is one of the most important problems for future acceptance of robots with learned components.
I gave up on this fight some years ago. Nevertheless, good luck!
One of the last things Partha ever said to me was "though it is a wholly empirical field, machine learning is remarkably impervious to empiricism."
That sounds so much like Partha..
I mostly agree with you, but I still think in very low signal to noise environments, specifically in trading, a lot of these still apply. Particularly when you are predicting on a longer time horizon, and thus have lower samples. HFTs and other pods predicting at shorter time horizons are moving away from this convention though. They're increasingly demonstrating success in applying deep learning to asset price prediction.
Every time I mention this topic, a common rebuttal is financial data. Unfortunately, I have no way of evaluating such assertions as trading data is always private
It's just gambling and rent extraction anyway. Pure waste of cognitive investment to worry about concerns brought up by folks whose application is predicting financial markets.
I love your blog, but I think this post is totally off base. Much of the success of deep learning has been discovering tricks (e.g., early stopping, dropout, SGD, convolutional and other forms of regularizing structure) to overcome overfitting. You might say "overfitting doesn't exist", but (I would argue) that's primarily because researchers have found ways to overcome it!
Consider a simple example: model a density with collection of delta functions, one for each datapoint in the training set. Are you going to say that such a model *isn't* overfit? (It's a cop out to say "no one would ever do this". Of course they wouldn't — precisely because such a model would suffer from extreme overfitting!)
Someone should really write a paper investigating whether all of the things you list are necessary to overcome overfitting.
https://arxiv.org/abs/1611.03530
I think the critique is more that overfitting is used as normative concept standing in for bad generalization properties of a method. The literature often suggest that if you avoid overfitting, your method will in turn (apriori) generalize well in all situations. But being nihilists about generalization theory (judging from previous posts I assume Ben to be one too), we know that this is an illusion. Thus the connection from overfitting to generalization is severed. And that leaves us with what? Observing that you didn't like the performance of a method on certain data - as in your example.
Generalization nihilists (or, better, Humeans) unite!
https://simons.berkeley.edu/talks/max-raginsky-university-illinois-urbana-champaign-2024-09-12
IIRC, you didn't say overfitting once in your talk. :)
That’s because it doesn’t exist!
Thank you for plugging it here! Great talk!
Glad you liked it!
Thanks for linking your talk, you made many points in passing that have taken me a long time to figure out... I especially like how you point out that every inductive inference requires taking risks. Once I stated in a manuscript that inductions require judgment calls and these are hard - perhaps impossible - to encode in software. The reviewers were not impressed.
For an economist, what's obviously missing in this list is a theoretical rationale. Economists were way ahead of the curve in denouncing data mining (still a pejorative in the profession, unlike everywhere else). That partly reflected early experience with techniques like stepwise regression (at a time when other social scientist s thought 2*2 ANOVA was cool). But mostly it was the view that unless your results fitted into a theoretical framework they were, at best, curiosities.
Of particular relevance to machine learning was the rejection of discriminant analysis (the basic tool of AI), eve now in favour fo choice modelling. McFadden is the name to check here.
I can not follow the argument did not convince me, that there might be many reasons why model predictions for a new dataset are "not so good", therefore "conventional overfitting" does not exist.
Could you please cite one statistician who says that pre registration is all you need?
Is there a published version of this blog post that can be cited?
> 1. An analysis works *too* well on one data set.
How much is too much? Like pornography, you'll know it when you see it.
In certain settings, empirical risk minimization fails to obtain minimax rates. Arguably, it would be natural to say that ERM is "overfitting" in these settings. That said, if overfitting is a property of a learning curve, then minimax is not the right notion, and we'd have to look at so-called universal rates, such as have been studied by Hanneke and others (https://openreview.net/forum?id=6cWDg9t3z5).
Do you have any useful resources that might help with understanding why these are terrible advice. Are there specific assumptions that stat learning theory make that do not apply to deep learning?
Can you recommend something similar to "Understanding deep learning requires rethinking generalization."[https://arxiv.org/pdf/1611.03530] that is more recent/addresses the problems stated?