3 Comments
User's avatar
David Khoo's avatar

The funny thing is that benchmarks play almost no part in the actual industrial practice of machine learning. Basic MLOps best practices dictate that your model should constantly be training and testing on new data, not a static benchmark. February's data is the training set for the March test set. And March's data becomes the new training set for the April test set. And so your machine learning system trundles forward into the future, always renewing itself, without a static benchmark in sight.

Or almost. You might keep a static benchmark for unit or integration tests, just to check if the code runs and produces sane results. And that's really where I think benchmarks should be used, by analogy with real life bench marks etched into work surfaces. They are tools for internal calibration, to check your code is working and your model is improving. Internal reuse is fine since you can't fool yourself for long. But they shouldn't be tools for external comparison between different models, because it's too easy and profitable to fool others sometimes, especially if your models are closed.

Yes, it can work if everyone is completely honest with themselves and each other. There was a golden age of open code and open models when that was true, and that was a credit to the field. But that age is closing, and people are paying less and less attention to benchmarks because good performance on them means so little now. Every machine learning paper boasts SOTA results on all the latest benchmarks, but nearly none of them actually have real world value. Every new LLM busts every popular benchmark, yet they don't actually seem to be getting more useful, just more impressive. Something has gone wrong over the last few years.

Expand full comment
Ben Recht's avatar

You make many excellent points here, but there are many threads to untangle. I'll discuss the validity of benchmarks on Thursday. Not all benchmarks are good. Not all benchmarks last forever. But just because some benchmarks become stale and some persist does not mean that people aren't constantly creating benchmarks and using them to guide design.

Expand full comment
Mark Johnson's avatar

Is there any good advice on how large the training and test sets need to be, and how exactly they should be split? How do dev sets fit into this?

The Penn Treebank that many of us working on syntactic parsing in NLP used had 24 sections, which corresponded to about 2 weeks of Wall Street Journal contents in chronological order. Sections 0 and 1 were usually ignored because the theory was that the annotators were just warming up, and so we thought they might contain more errors. Main train was sections 2-21, and the test set was section 23, which you were not supposed to look at and only use once each time you wrote paper (I suspect this wasn't followed in practice). The dev sets were sections 22 and 24 (we had two then). Section 24 was generally regarded as being a bit flakey because the annotation project was winding down (section 24 was only about 2/3 the size of the other sections, if I remember correctly), and the source time period was getting close to Xmas, which of course meant there was unusual text there.

There was substantial semantic drift in the Wall Street Journal over the time that the Penn Treebank was collected, so I always liked the idea that the test set (section 23) was temporally separated from the training data.

Expand full comment