Benchmark Studies

Ben Recht

Dec 10, 2025

It is impossible to disentangle technical innovation from technical debt

Read →

20 Comments

Love this framing!

One thing I've done in a speech ML class is cross-list it between Speech & Hearing and Engineering. It brings together students with clinical/social speech expertise and those with technical speech expertise. It's group and project based. I've tried to design each project so that both play a role - e.g. collect the right kind of speech data, design the right kind of feature extractors, interpret the output of the model in context, etc.

Plenty of new challenges emerge, it's really challenging to teach, but I’ve found that teaching ML without anchoring it in a domain-relevant problem isn’t all that useful in practice.

Ryan S

Dec 10

Regarding homework: could be time to take a cue from sociology and have CS grad students... write papers *gasp*

Reply (1)

Ben Recht

Dec 10

One thing I did this semester was add a question about their course project to every problem set. I'm going to lean more heavily on this next time I teach the class.

Alexandre Passos

Dec 10

I think there is these days interesting phenomenological math not ontological math you can justify well in ML. Things like "assuming you want your neural network's activations or gradients to be invariant to the number of layers, this is how you initialize / normalize / etc", or "these are useful power law models of how neural networks learn", or "neural networks are obviously not quadratics but if you squint and pretend that they are you can predict a lot of the curves seen during training". Which I think vibes very well with your argument that generalization is an axiom and with the general vibe that ML made not much progress while it tried to treat things as math but only unblocked when it started trying to treat things as physics.

Reply (1)

Ben Recht

Dec 10

Hmm. I'm not sure I buy that the phenomenology is math or physics. Machine learning is an engineering discipline. The phenomenology is a convenient way to plot best practice, but I don't think it provides any fundamental laws. For example, no one has come up with a reasonable explanation of these silly scatter plots yet, but they are very helpful to inform practice: https://arxiv.org/abs/1902.10811

Anna Gilbert

Dec 10

I certainly agree that benchmarking is the main approach to “proving” a method works and that checking the leaderboard can tell you which method or architecture makes the most sense for a range of benchmark datasets and tasks but the “Kagglification” of research has some real pitfalls. Are we sure people have established the baselines thoroughly and completely? Can we be sure of the leaderboard results when we can’t reproduce the results ourselves (it’s really hard to re-run someone else’s experiments)? How do we agree, as a community, on the benchmark datasets and tasks? How do we know that said benchmark datasets are representative of the type of task we really want to run on?

In other words, it’s a good model and it could do with some best practices from scientific experiments. I’ve also bandied about the idea of a “model organism” for ML as there are for various biological processes.

Reply (1)

Ben Recht

Dec 11

Yes, absolutely. I'm not arguing to Kagglify all of research, but *machine learning* is inseparable from benchmarking. And though I understand the unease in your questions, I think the evidence is quite compelling that most of the interesting results in AI can be traced back to this culture of benchmarking.

That said, I am by no means a machine learning imperialist. It's a useful engineering technology with undeniably impressive applications. But I'm worried its success has convinced too many people that it is a panacea for all scientific advancement. I don't subscribe to that view!

Reply (1)

Anna Gilbert

Dec 11

Totally agree with you! It’s not a panacea for scientific advancement. And, the really big innovations have come about because of benchmarking. Maybe my point can be summarized as: do the benchmarking really really carefully and well to show a truly big advance and not every medium-sized idea is a true advance :)

Reply (1)

Ludwig Schmidt

Dec 28

Just a drive-by comment here as I'm scrolling through some of Ben's posts over the holidays: I agree with everything that has been said, thanks for the great discussion! I just want to add that many medium-sized ideas can stack up to give you performance gains that are as large as a truly big advance in machine learning. Benchmarks are good at measuring how the medium and big advances compare. I think combining many medium-sized ideas, as long as they stack, is good engineering practice and worth mentioning / teaching in the context of benchmarking.

Manjari Narayan

Dec 12

> Of course, this is frustrating as all hell to computer science students who are taught that you just type incantations into Jupyter notebooks and conjure up pure logical positivism. Machine learning has a recipe book, but, unlike when I’m teaching an undergraduate course in algorithms, I can’t justify much of it at all.

I've been wanting to articulate that there is an old-school empiricism that has taken over ML and then felt to cowardly to put it out there.

Two things

a) Do all the results about benchmarks and competitive testing hold for biological problems that have far greater external validity problems than say computer vision? Most biological competitions I've been a part of have a huge difference between private and public leaderboards especially when the private leaderboard contains additional out of distribution examples.

b) Is the capacity for the scientist/analyst to fool themselves so high in practical ML that benchmark studies and leaderboards actually provide a fail-safe against this. Statistical theory does not really take that into account very much.

Reply (1)

Ben Recht

Dec 12

(a) Do you consider CASP a benchmark in biology? I do! https://predictioncenter.org/

(b) I do think that benchmarks and leaderboards provide an interesting failsafe. They are not a panacea, of course.

Reply (1)

Manjari Narayan

Dec 12

Yes that is a benchmark. But CASP was an unusually well-posed problem that is not representative of all that remains to be done. Predicting structure and development genetic markers is something we've made progress on but everything related to biological function which is incredibly context dependent in unmeasured ways has proven harder.

Measurements of structure are not as messy as measurements of function. The results are very different for problems within animal and human data, even organoids. Though to your point, benchmarking in these domains would still help expose how poor predictive models being published in science papers actually are. But only to the degree that they get evaluations right.

For example, this DREAM competition for oncology biomarkers didn't realize their evaluation strategy did not line up with the actual use-case of differentiating treatment effects for patients. And we now know real world performance of anti-PDL1 as a biomarker is abysmal.

https://www.synapse.org/Synapse:syn18404605/wiki/589611

The fundamental problem of differential treatment predictive biomarker work is that one can never observe the same patient in the exact same circumstance receive 2 different treatments. In cancer in particular, there are so many challenges. For instance, patients will be switched to treatments that have a chance of working compared to a prior treatment. Thus there are all kinds of systemic differences in why some patients are measured or unmeasured in different treatment arms Throwing all the clinical trial data from the treatment arms into machine learning models separately was not going to develop a biomarker with the desired performance characteristics. Yet many competitions like the DREAM challenge did exactly this.

There is a lot going on in this example however. The very concept of a responder is a causal construct that is deeply unappreciated and distorted within every NIH institute.

The AI Architect

Dec 11

Spot-on framing! The "generalization as axiom" lens nails why ML pedagogy feels so awkward. What really stands out though is how this organic pull-request evolution accidentally mirrors biological adaptation: vestigal structures, local fitness maxima, and selection pressure from leaderboards instead of enviroments. Maybe the technical debt isn't a bug but the actual substrate.

Reply (1)

Maxim Raginsky

Dec 11

There are certainly parallels to biological evolution, I wrote about this a while ago: https://realizable.substack.com/p/verum-et-factum-convertuntur-again

Joe Jordan

Dec 11

My claim would be that the computer program that classifies images is fairly straightforward to write, and exists already in nucce on every device in the world: the jpeg compression algorithm. If you do PCA on a bunch of images you will get a set of filters, the eigenvectors, and their weights, the eigenvalues. The jpeg algorithm has different weights than a neural net for classification but they both work by turning an image into a linear combination of filters. This is also basically what attention heads are doing, but in a higher dimensional space.

Kevin M

Dec 10

What about questions that is more reflective? Maybe you derive a proof to convince yourself it works this way and then having them reflect about it in more humanities perspective?

Reply (1)

Ben Recht

Dec 11

Mathematical theory is a convenient language for formalization. The problem arises when your theory is too fictional to guide practical considerations.

Adam Ginensky

Dec 10

You write- "Despite statistical arguments declaring it fundamentally flawed, the culture of competitive testing on benchmarks has driven and still drives the engine of what the field defines as progress. " Can you justify this ? For example I was under the impression that various CV methods were asymptotically AIC and BIC.

In general, I think that ML is driven by 'approximate methods' in the sense that we observe data that we know is noisy, therefore, we can't derive exact answers. Things like PAC are the best we can do. I think this makes it different from most other applications of mathematics to science.

Reply (1)

Ben Recht

Dec 10

PAC Learning makes poor predictions and gives bad advice. I don't think it's the best we can do. https://www.argmin.net/p/thou-shalt-not-overfit

arg min

Benchmark Studies