The Data Winter

Oct 9, 2023

When did machine learning finally realize it needed data to thrive?

7 Comments

Oct 9, 2023

Inspired by all your digging, I started looking into the history of cross-validation. Via Cosma Shalizi's excellent notes https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ I discovered Stone (1974) "Cross-Validatory Choice and Assessment of Statistical Predictions". It contains a brief discussion of pre-1959 sample-splitting work (from the 30s, 40s, and 50s), including a series of papers published in a 1951 "symposium" on "The need and means of cross-validation". I'd love to hear your thoughts on this line of historical work, as it relates to Highleyman’s contributions!

Expand full comment

Reply (1)

Ben Recht

Oct 9, 2023

I'm on it! I will report back with what I find.

Expand full comment

Erik

Oct 11, 2023

Platt scaling always seemed like some kind of duct tape (in a respectful way).

Expand full comment

Reply (1)

Ben Recht

Oct 11, 2023

John Platt is an endless source of clever and innovative ideas.

Expand full comment

Mario Figueiredo

Oct 10, 2023

I have been thoroughly enjoying these recent posts of yours! They made me go back to an old favourite book from 1996, by Brian D. Ripley, "Pattern Recognition and Neural Networks". Unlike most ML books, but just like your PPA, it starts with a chapter on decision/prediction theory. Also unlike any other ML book I've read, Ripley gives credit to Highleyman: "The idea of of a test set is sometimes called the hold-out method and goes back at least to Highleyman (1962)".

Expand full comment

Reply (1)

Ben Recht

Oct 11, 2023

Ripley's is such a great book. Very clear and very aligned with the old and new conceptions of pattern recognition.

Expand full comment

Davis Yoshida

Oct 10, 2023

> There was tremendous excitement in the air, even if we were all deeply confused. Everyone was defining their own problem and pulled in different directions.

This could easily be describing the current LLM moment as well!

Expand full comment

arg min

The Data Winter