2 Comments
User's avatar
David Khoo's avatar

The funny thing is that benchmarks play almost no part in the actual industrial practice of machine learning. Basic MLOps best practices dictate that your model should constantly be training and testing on new data, not a static benchmark. February's data is the training set for the March test set. And March's data becomes the new training set for the April test set. And so your machine learning system trundles forward into the future, always renewing itself, without a static benchmark in sight.

Or almost. You might keep a static benchmark for unit or integration tests, just to check if the code runs and produces sane results. And that's really where I think benchmarks should be used, by analogy with real life bench marks etched into work surfaces. They are tools for internal calibration, to check your code is working and your model is improving. Internal reuse is fine since you can't fool yourself for long. But they shouldn't be tools for external comparison between different models, because it's too easy and profitable to fool others sometimes, especially if your models are closed.

Yes, it can work if everyone is completely honest with themselves and each other. There was a golden age of open code and open models when that was true, and that was a credit to the field. But that age is closing, and people are paying less and less attention to benchmarks because good performance on them means so little now. Every machine learning paper boasts SOTA results on all the latest benchmarks, but nearly none of them actually have real world value. Every new LLM busts every popular benchmark, yet they don't actually seem to be getting more useful, just more impressive. Something has gone wrong over the last few years.

Expand full comment
Ben Recht's avatar

You make many excellent points here, but there are many threads to untangle. I'll discuss the validity of benchmarks on Thursday. Not all benchmarks are good. Not all benchmarks last forever. But just because some benchmarks become stale and some persist does not mean that people aren't constantly creating benchmarks and using them to guide design.

Expand full comment