Thou Shalt Not Overfit

Ben Recht

Jan 30

Venting my spleen about the persistent inanity about overfitting.

Read →

46 Comments

James McDermott

Jan 30

> When the future turns out not to be like the past, machine learning can’t work!

Yes, but what about when the future *is* like the past, but performance is still bad? Maybe this should be the definition of overfitting. Great post.

Expand full comment

Yaroslav Bulatov

Jan 30

This reminds me of another problematic definition - "transfer learning". NSF Grant on transfer learning funded my PhD studies. But I haven't figured out what it means

Expand full comment

Reply (1)

Ben Recht

Jan 30

Same!

Expand full comment

Nicholas Mancuso

Jan 30

I think most of these general definitions fail to focus on why this process occurs--which imo is due to an overly complex model that tries to interpolate the noise. Ignoring confounding aspects (which could lead to false generalization in some sense), an example would be if the underlying data generating process was some linear model with iid noise, but the fitted model was some DNN.

Expand full comment

Reply (1)

Ben Recht

Jan 30Edited

My issue is that this problem is solved by the holdout method. And no one does machine learning without a test set.

Also, if you use a test set, a DNN will fit a linear model with iid noise.

Expand full comment

Reply (1)

Nicholas Mancuso

Feb 12

Sure, those are means to solve/reduce overfitting for those scenarios, but that doesn't mean that definition of overfitting fails to capture those scenarios in the first place.

Defining what is overfitting and describing how to address it are separate tasks!

Expand full comment

Udit Ranasaria

Jan 30

It feels like your primary problem is the simplistic linguistics or "rules of thumb" that were probably designed to cater to even the bottom quintile of students/practitioners. But maybe it's more apparent as some sort of problematic dogma in your surroundings than what appears in mine. In my experience, most know that poor generalization performance isn't obviously directly due to one of 10 things.

Expand full comment

Reply (1)

Ben Recht

Jan 31

What are the more common factors people tend to attribute poor generalization to in your experience?

Expand full comment

mokne

Jan 30

I mostly agree with you, but I still think in very low signal to noise environments, specifically in trading, a lot of these still apply. Particularly when you are predicting on a longer time horizon, and thus have lower samples. HFTs and other pods predicting at shorter time horizons are moving away from this convention though. They're increasingly demonstrating success in applying deep learning to asset price prediction.

Expand full comment

Reply (1)

Ben Recht

Jan 30

Every time I mention this topic, a common rebuttal is financial data. Unfortunately, I have no way of evaluating such assertions as trading data is always private

Expand full comment

Reply (1)

Matt

Jan 30

It's just gambling and rent extraction anyway. Pure waste of cognitive investment to worry about concerns brought up by folks whose application is predicting financial markets.

Expand full comment

Chris

Jan 31

In grad school I was constructing RBF kernels on 50,000 features for < 250 examples and there were always people shouting CuRSe oF DimEnSioNALiTy!!!!11! I just kind of ignored them. The kernels had eigengaps and decaying spectra and that was fine for me.

The one-dimensional sinewave classifier (infinite VC dimension with one parameter!) is my favorite thought experiment for why counting parameters is pointless.

Expand full comment

Tucker Hermans

Jan 30

Since this became the topic of the day I decided to pull Tom Mitchell's book off my shelf to see if he has offered a definition. He does and it's mostly a formalized version of what Charles said:

"Given a hypothesis space H, a hypothesis h in H is said to overfit the training data if there exists some alternative hypothesis h' in H, such that h has smaller error than h' on the training examples, but h' has smaller error than h over the entire distribution of instances."

So yeah, use a validation set to pick the best hypothesis.

I'll also add that the nice thing about working in robotics is we always generate new test sets, i.e. when the robot actually does something, so I'm never really concerned with training for the test/validation set, etc. that folks bring up as a concern for trusting cross validation.

Expand full comment

Reply (1)

Ben Recht

Jan 31

My issue with Mitchell's definition of overfitting is that you can have h and h' having *equal* train error and different test error.

I agree with you about robotics, of course. Do you do a lot of machine learning for robotics? If so, I'm curious what your evaluation practices are.

Expand full comment

Reply (1)

Tucker Hermans

Jan 31

Yeah, most of my work is learning-based and usually couple with other non-learning components. My preferred evaluation criteria is to have some clearly stated task success test that we can say either it succeeded or not, e.g. the robot lifted the object and placed it on the desired shelf.

The difficult part is coming up with meaningful set of conditions to show desired generalization to things like novel object instances or environments not seen during training. Its something I'm interested in formalizing more. I think providing verification is one of the most important problems for future acceptance of robots with learned components.

Expand full comment

Misha Belkin

Jan 30

I gave up on this fight some years ago. Nevertheless, good luck!

Expand full comment

Reply (1)

Ben Recht

Jan 30

One of the last things Partha ever said to me was "though it is a wholly empirical field, machine learning is remarkably impervious to empiricism."

Expand full comment

Reply (1)

Misha Belkin

Jan 30

That sounds so much like Partha..

Expand full comment

Walter

Jan 30

I love your blog, but I think this post is totally off base. Much of the success of deep learning has been discovering tricks (e.g., early stopping, dropout, SGD, convolutional and other forms of regularizing structure) to overcome overfitting. You might say "overfitting doesn't exist", but (I would argue) that's primarily because researchers have found ways to overcome it!

Consider a simple example: model a density with collection of delta functions, one for each datapoint in the training set. Are you going to say that such a model *isn't* overfit? (It's a cop out to say "no one would ever do this". Of course they wouldn't — precisely because such a model would suffer from extreme overfitting!)

Expand full comment

Reply (2)

Ben Recht

Jan 30

Someone should really write a paper investigating whether all of the things you list are necessary to overcome overfitting.

https://arxiv.org/abs/1611.03530

Expand full comment

Nico Formanek

Jan 30

I think the critique is more that overfitting is used as normative concept standing in for bad generalization properties of a method. The literature often suggest that if you avoid overfitting, your method will in turn (apriori) generalize well in all situations. But being nihilists about generalization theory (judging from previous posts I assume Ben to be one too), we know that this is an illusion. Thus the connection from overfitting to generalization is severed. And that leaves us with what? Observing that you didn't like the performance of a method on certain data - as in your example.

Expand full comment

Reply (1)

Maxim Raginsky

Jan 31

Generalization nihilists (or, better, Humeans) unite!

https://simons.berkeley.edu/talks/max-raginsky-university-illinois-urbana-champaign-2024-09-12

Expand full comment

Reply (3)

Ben Recht

Jan 31

IIRC, you didn't say overfitting once in your talk. :)

Expand full comment

Reply (1)

Maxim Raginsky

Jan 31

That’s because it doesn’t exist!

Expand full comment

Badri

Feb 21

Thank you for plugging it here! Great talk!

Expand full comment

Reply (1)

Maxim Raginsky

Feb 21

Glad you liked it!

Expand full comment

Nico Formanek

Jan 31

Thanks for linking your talk, you made many points in passing that have taken me a long time to figure out... I especially like how you point out that every inductive inference requires taking risks. Once I stated in a manuscript that inductions require judgment calls and these are hard - perhaps impossible - to encode in software. The reviewers were not impressed.

Expand full comment

John Quiggin

Jan 31

For an economist, what's obviously missing in this list is a theoretical rationale. Economists were way ahead of the curve in denouncing data mining (still a pejorative in the profession, unlike everywhere else). That partly reflected early experience with techniques like stepwise regression (at a time when other social scientist s thought 2*2 ANOVA was cool). But mostly it was the view that unless your results fitted into a theoretical framework they were, at best, curiosities.

Of particular relevance to machine learning was the rejection of discriminant analysis (the basic tool of AI), eve now in favour fo choice modelling. McFadden is the name to check here.

Expand full comment

Samuel

Feb 12

I can not follow the argument did not convince me, that there might be many reasons why model predictions for a new dataset are "not so good", therefore "conventional overfitting" does not exist.

Expand full comment

Tom Rath

Mar 3

Could you please cite one statistician who says that pre registration is all you need?

Expand full comment

Klaus-Rudolf Kladny

Feb 19

Is there a published version of this blog post that can be cited?

Expand full comment

Fernando Rivas

Feb 11

> 1. An analysis works *too* well on one data set.

How much is too much? Like pornography, you'll know it when you see it.

Expand full comment

Daniel Roy

Feb 4

In certain settings, empirical risk minimization fails to obtain minimax rates. Arguably, it would be natural to say that ERM is "overfitting" in these settings. That said, if overfitting is a property of a learning curve, then minimax is not the right notion, and we'd have to look at so-called universal rates, such as have been studied by Hanneke and others (https://openreview.net/forum?id=6cWDg9t3z5).

Expand full comment

Victor Fizesan

Feb 1

Do you have any useful resources that might help with understanding why these are terrible advice. Are there specific assumptions that stat learning theory make that do not apply to deep learning?

Expand full comment

Reply (1)

Victor Fizesan

Feb 1

Can you recommend something similar to "Understanding deep learning requires rethinking generalization."[https://arxiv.org/pdf/1611.03530] that is more recent/addresses the problems stated?

Expand full comment

arg min

Thou Shalt Not Overfit