Benchmark Studies

It is impossible to disentangle technical innovation from technical debt

Dec 10, 2025

This is a live blog of the final lecture of the 2025 edition of my graduate machine learning class “Patterns, Predictions, and Actions.” A full Table of Contents is here. I tried to summarize my semester reflections in class on Thursday, but found my thoughts haven’t quite settled yet. I’m hoping a week of posting will help me sort through it.

My GSIs (one in particular) gave me a lot of guff about how the homework was too easy this semester. My response is that the theoretical foundations of the course are still too weak for me to pose hard math questions in good faith. Yes, there are difficult mathematical problems in finding the optimal constants for certain learning-theoretic bounds. But finding constants for theories that give bad advice in practice is a waste of everyone’s time. In machine learning, we learn more from social analysis than functional analysis, and it’s hard to write problem sets on sociology.

The theory in machine learning is frustratingly less mathematical than that of other fields of information engineering. For example, consider mathematical optimization, which I taught last fall. There, you begin with a mathematical modeling language. If you believe that you have a problem that can be written in this language, optimization provides algorithms to find the solution. If you can’t write your problem in that language, you’ll need to try something else.

You might think machine learning works that way too. If we believe we have a population in which a certain variable is well approximated by a parametric function of the other variables, then sure, we can look into algorithms to estimate that function via random sampling from the population. However, we rarely ever believe in this parametric assumption in applied machine learning. This is where we become untethered from the mathematics. In almost every problem that people care about, we simply don’t know the functional relationship connecting these quantities. Because if we did know it, we wouldn’t be using machine learning in the first place.

I like to illustrate this with two extreme cases:

I believe that there is a computer program that can tell whether a string of bits has an even or odd number of ones. I can write this program in a single line of Python. I don’t use machine learning.
I believe there is a computer program that can identify animals in images. I have no idea how to write that computer program. I turn to machine learning.

In machine learning, the price of admission is a belief that pattern recognition is possible and the conviction that it’s too hard to write a function describing these patterns from first principles. Generalization is an axiom, not a theorem. It would be nice if we could say one algorithm is better than another at finding these prediction functions, but we don’t have theory for that. Instead, we have to look to engineering “best practice” for what to do next.

Am I saying that every machine learning course has to have a few weeks on science studies? Yes I am.

Of course, this is frustrating as all hell to computer science students who are taught that you just type incantations into Jupyter notebooks and conjure up pure logical positivism. Machine learning has a recipe book, but, unlike when I’m teaching an undergraduate course in algorithms, I can’t justify much of it at all.

I know that I can approximate arbitrary nonlinear patterns as a composition of simple, componentwise nonlinear functions and linear maps (what we call neural networks). I know that I can arrange data streams into arrays that respect certain local consistency properties, thinking of text as sequences of tokens and images as spatial arrays of patches. I can compose these basic primitives to construct a bunch of different candidate prediction functions. I can use numerical search to find the linear maps in the resulting functional expression.

This set of weakly specified building blocks begets a zoo of methods. But we can only tell stories for why we’d prefer one over another. Random Forests, CNNs, RNNs, and Transformers are all potentially useful to get your number to go up, but they aren’t fundamental in any sense. There are some architectures that people get excited about for a while, and they yell about how awesome they are. Then some new architecture becomes exciting, and they yell about that. People build the next cool thing on the last cool thing. They tend not to emphasize how data is fundamental, and how it’s essential to set up your ETL pipeline in precisely the right way. But those pesky details are easy enough to find if you look through the various open-source machine learning repositories. And so machine learning continues, one pull request at a time.

This organic process is fine! But I don’t think you can explain anything about it with large deviation inequalities or functional analysis. How can I know which method is best? I check my answer on some leaderboard.

I’ve been trying to figure out how best to teach this in context. Our machine learning practices make it impossible to disentangle technical innovation from technical debt. I don’t want to prove theorems about some widget that is currently widely deployed because maybe it won’t be used next week. Some components are vestigial structures left behind after a frenzied series of paper deadlines and investor calls. Which structures? I can’t tell you.

On the other hand, machine learning has some shockingly robust practices that other fields should emulate. The train-test paradigm is fascinating. How can I know which method is best? I check my answer on some leaderboard. Despite statistical arguments declaring it fundamentally flawed, the culture of competitive testing on benchmarks has driven and still drives the engine of what the field defines as progress. We still can’t explain much about why it works as well as it does. We don’t have compelling theories for why benchmarks don’t go stale as they saturate, or why we see the patterns we see in empirical performance on these benchmarks. The social patterns here are fascinating, and they should be taught more explicitly in machine learning courses.

Although some argue that we need to move beyond the benchmarking paradigm, I would counter that the benchmarking paradigm defines the field. Believe that pattern recognition is possible. Specify your metric at the population level. Gather two samples representative of this population and use one for play and one for benchmarking, trying to maximize your metric. Once you get bored with the benchmark, make a new one. That’s machine learning in a nutshell. In practice, machine learning sociology is all we need.

Matt

Dec 10

Love this framing!

Visar Berisha

One thing I've done in a speech ML class is cross-list it between Speech & Hearing and Engineering. It brings together students with clinical/social speech expertise and those with technical speech expertise. It's group and project based. I've tried to design each project so that both play a role - e.g. collect the right kind of speech data, design the right kind of feature extractors, interpret the output of the model in context, etc.

Plenty of new challenges emerge, it's really challenging to teach, but I’ve found that teaching ML without anchoring it in a domain-relevant problem isn’t all that useful in practice.

18 more comments...

arg min

Discussion about this post

Ready for more?