18 Comments

I was an early partisan of the fundamental importance of the test set! In this 2005 NIPS paper with Lana on estimating the intrinsic dimension (our first paper together!), we had to use test-set estimates in order to correct for the negative bias of ERM, which led to deflated estimates in earlier work: https://proceedings.neurips.cc/paper_files/paper/2005/hash/892c3b1c6dccd52936e27cbd0ff683d6-Abstract.html.

Also, estimating the intrinsic dimension! Does anybody care about manifold learning anymore? Back then it was a thing (if only a minor thing), but here's also a bittersweet tidbit: Partha Niyogi was chairing the session in which I gave the talk and Sam Roweis asked the first question.

Expand full comment

Still very sad we lost those two.

Expand full comment

Methods like diffusion models are doing manifold learning without applying that label explicitly.

Expand full comment

"manifold" but yes.

Expand full comment

We have always thought of real data as "manifolds" even if theoretical analyses were for manifolds.

Expand full comment

yes, totally agree! but for a while, people got a bit too obsessed with unrolling swiss rolls and such.

Expand full comment

Swiss rolls are delicious.

Expand full comment

This is a fun manifold learning paper. https://arxiv.org/abs/1710.11379

I’ll actually be covering it in my Probabilistic Deep Learning class on Thursday as part of my second day on variational autoencoders.

Expand full comment

You might like this paper I wrote with my former student Josh Hanson:

https://link.springer.com/article/10.1007/s00498-025-00410-2

Expand full comment

Isn't one factor in the explanation the fact that "classic" learning theory was mostly heavily biased to analyze "worst-case" scenarios? A randomized hold-out set is somewhat opposed to this approach (as it will implicitly be asking what happens if you made a misrepresentative / "bad" train-test split?)

On the other hand, it will probably also be hard to find explicit discussion of the hold-out evaluation practice in the works that tried to develop alternative theories of generalization, to overcome that limitation [I'm thinking of the Seung, Sompolinsky, Tishby "Statistical mechanics of learning from examples" paper/s, and related ideas].

Expand full comment

I don't have answers! Only questions. :)

I mean, it would be cool if you could just build machines without testing them, I suppose.

Expand full comment

The obsession with uniform convergence is justified by that learning theorists wanted methods that are guaranteed to work and also because of their love for worst case results. And in the narrow case when the goal is formulated as competing with the next hypothesis in a fixed class, uniform convergence allows you to conclude that the best you can do (in the worst case) is to do ERM. In other words, even in those old days, it was hard to write papers about obvious things, like using a test set (but as I wrote elsewhere test sets do make appearance in some classic books; and I also recall some paper on "model selection and error estimation" by Bartlett, Boucheron and Lugosi, from 2002). If you add a little spice, eg, perform cross validation or at least do model selection with infinitely many models (as in the above paper) theorists get way more excited. It's an interesting side effect of the culture we have is that people often don't write papers about the powerful simple things. Though at least the (general purpose) books should mention them.

Expand full comment

I have the feeling that learning theory is concerned with the higher reaches of the inductive hierarchy. The stuff that gives you apriori guarantees through uncomputable universal priors or similarly contrived inductive assumptions. The holdout method on the other hand seems to be content with an aposteriori statement of success. So perhaps the goals are just completely different.

Expand full comment

I agree with your assessment of where we are now with learning theory, but I think that the initial goals in the 1960s were far more pragmatic. At least this is what Chervonenkis' recollections suggest.

Also, and his is just speculation on my part, I'm guessing they had to think about pragmatic justification given that they were developing these ideas in the Russian Institute for Control Sciences in the middle of the space race.

Expand full comment

This post is great!

I'm trying to understand the question you are raising here better.

The papers you mentioned from Blum-Hardt and Dwork et al seems to be mostly concerned about protecting your test set so people do not adaptively overfit to it.

However I feel the question you are asking is more in the direction of whether I can trust that a model generalizes well if it works well on the test set---say we don't know anything about the training or optimization process involved in constructing the model. Is this the main problem?

I think the plot suggests that if we use a test set too many times, we might "overfit" and then on a new test set, the model doesn't work as well, but we do not know why the new test/old test plot looks like a line.

Expand full comment

That's right! The thing I don't understand about my plot and the dozens of papers that have made similar plots is this:

Models that get higher scores on the public test set, get higher scores on the private/replicated test set. And larger, more recent models tend to get the highest scores. Why?

But this gets back to your first point. I like the Blum-Hardt and Dwork et al. papers, but they are proposing modifications to the holdout method that don't seem necessary in practice.

Expand full comment

Just a thanks here. I'm a PhD statistician who worked in comp bio for a while and then in consumer facing ML for a decade and am now transitioning into AI safety. The blogs from the kids really help me to stay informed about developments in the AI literature in niches I don't keep as on top of. But your blog is consistently the most informative and thought provoking!

Expand full comment

Thank you!

Expand full comment