To Measure Is to Know

Deb's initial reflections on our machine learning evaluation course.

Apr 25, 2025

Today’s post is by my co-instructor Deb Raji, in which she shares her reflections on our graduate class on machine learning. Deb and I both violently agreed and disagreed throughout the semester, and that’s part of what made the class so fun to teach. I’m excited to share her perspectives here on argmin. You can follow Deb on Twitter and Bluesky.

“Machine learning evaluation” is usually relegated to the last lecture of any Intro to Machine Learning course—a longstanding afterthought of the field, unfairly pushed to the back corners of the minds of students, researchers and practitioners alike. On its surface, machine learning evaluation should be simple. We have expectations of what a model can do, and we should be able to straightforwardly measure what it is that the model actually does -- the discrepancy should be as simple as running the model on a hold out set and comparing its predictions to ground truth.

Despite the imperfection of our first attempt, I like that our class immediately tried to open up the Pandora’s box of this topic and properly engage with its complexity. We didn’t run away or pretend to understand something that the field has been avoiding almost since its inception. And I believe we were rewarded for that. Within just a couple weeks of inquiry, it became immediately clear that machine learning evaluation was not as simple as we’d like to think.

The first breakdown happened with our foray into the formal assumptions of the hold out method, the central dogma of evaluating supervised learning models. Take an adequate sample of in-distribution data and keep it secret. Compare your model’s predictions on this data to some ground truth expectation of what the outcome or label should be -- the degree of overlap between the predictions and ground truth in your test set is your evaluation. This is the statistical rationale behind every benchmark, and we walked through the nice generalization claims that follow from that rationale. But then we started walking through actual examples -- wait, how many test sets are actually kept secret? What does it mean for a benchmark to be representative? How big are these samples - are they big enough for any of the generalization claims to even hold?

Soon enough, we were learning new terminology to describe these assumption violations: benchmark bias, data contamination, statistical power. Some of the discrepancies were just strange signs of the times -- for instance, for large language model evaluation, we don’t even sample our test and train data from the same distribution anymore, violating a foundational assumption of the hold out method. Even when a training set is provided (as it is for benchmarks like GLUE, SuperGLUE and ARC-AGl), people train on whatever they want and then at most fine tune on that training set to orient the task before evaluation on the test set. This wasn’t the case with past benchmarks like ImageNet, and this doesn’t align with the historical statistical assumption of benchmark evaluations for predictions. We keep holding on to the hold out method as if it means something, but there’s no reason we should expect the same kind of statistical guarantees with this new approach to evaluation -- it’s clear that for modern benchmarks, the old rules don’t apply.

This led us very quickly down the garden path of alternative views on “benchmarking” and the new crop of efforts tentatively exploring what lay beyond the default benchmarking paradigm. Despite clear deviations from the statistical assumptions of the hold out method, we concluded that benchmarks are still good for something, and are, in many cases, still consistently reliable proxies for algorithm selection and ranking. They also don’t fall short of exposing functional blind spots, and regularly reveal model capabilities that surprise us. At the same time, though, they are clearly not enough - it was fun to explore how Human-AI interaction lab studies, randomized experiments, compilable unit tests, online evaluations and other attempts in the field to explore outside the paradigm were providing another layer of insight into how these models behave, especially in deployment. If anything, I wish we had spent more time looking beyond the benchmarking paradigm and exploring what those alternative evaluation approaches had to offer.

Another major theme of the course that I wish we had more time to dig into was validity. It took us perhaps too long to properly convey to the class what validity and reliability were about and to break down the pretty unfamiliar concepts on internal, external and construct validity. These are concepts traditionally brought up in the context of experimental evaluations -- internal validity is about how much you can trust your experimental results; external validity about the generalizability of these experimental results; and construct validity oriented about how well you operationalized and designed the experiment relative to the real world problem you meant to evaluate for. The crude mapping of these concepts to the machine learning context - is internal validity “all the stuff you need to do to impress reviewer 2”? Is external validity about generalization to another benchmark or generalization to a real world setting? How do we even begin to think about construct validity for a “general purpose” model? This discussion lasted for about three classes, and I don’t even think we came close to a satisfying resolution.

There were certainly parts of the course I’d revisit completely -- it took a couple of classes to make progress on a coherent discussion on uncertainty; the detour into scoring rules, calibration and forecasting felt a bit disconnected. But even with some of these kinks, we pretty much always had a productive discussion, in no large part due to the remarkable engagement of our students. When designing it, we had no idea who would want to take this class (I feared no one would!), but we were pleasantly surprised by the strong enthusiasm and a fairly long waiting list. In the end, we were able to compose a class that was purposefully disciplinarily diverse. It included people working on robotics, NLP, vision; in ML application areas that included energy, healthcare, chemistry, computational biology, economics and more. We even had a few folks coming in from statistics and CS theory. It was just a radically fun space to be in, and I appreciate their contributions to the class the most. On one hand it made teaching challenging -- in one class on uncertainty, I could tell all the engineering and robotics people were bored of the discussion on uncertainty propagation, while some of the class had never heard of it; meanwhile, that same group struggled to grok confidence intervals, a topic the statistics students had long mastered. It was clear within a few months that the prerequisite knowledge was unevenly distributed, and this undoubtedly made things more difficult. But on the other hand, we learned so much from the discipline-specific experiences and perspectives shared on the topic -- from the over-reliance on demos in robotics to how computer vision’s ImageNet shaped the field and the NLP view on vibe-based evaluation. I often felt that the students could teach themselves, and that things were most informative for all of us when Ben and I got out of the way and let the conversation linger.

I like how we ended the class, which was an acknowledgement of something that I don’t even know if we fully grasped when getting into this: how we evaluate machine learning has far-reaching consequences. It’s not just leverage in competitive testing and algorithm ranking in papers or marketing materials—it’s also a barometer for market entry and procurement, with measurements used in litigation, policyaking, post-deployment monitoring, and audits of product safety. Throughout the class, it became clear that the severity of the problems we considered was a function of the stakes of measurement. When intense precision and validity were required, the measurement challenge grew exponentially.

By the time the course was winding down, I felt, like Ben, that we had more questions than answers. And why not? We had opened up Pandora’s box, we had shaken up the hornet’s nest, and these were the consequences. But ultimately, I’m grateful - I have long felt unsatisfied with the state of machine learning evaluation, and I now feel vindicated in my concern. Through the (often historical) readings and deep interdisciplinary discussions with the class, it became clear that I was far from alone. Almost since the beginning of the machine learning field, we have struggled with establishing a more principled approach to evaluation. Long before the chaos of large language models, we had felt this sore spot and ignored it. Now feels like a perfect time to dig in and finally figure things out.

A guest post by

Deb Raji (@rajiinio)

Interested in how AI lands in society. I also tweet @rajiinio. Finding an outlet for "writing that doesn't feel like homework".

Ormond

Not in this field. Do I detect a hint of alarm?

Raj Movva

Related to external validity, most benchmarks also suffer from poor ecological validity, i.e., is your testing environment at all related to the environment you ultimately care about. Which I'm now remembering you've already written about! :) https://ai.nejm.org/doi/abs/10.1056/AIe2401235

I wonder about these new benchmarks that test the ability to solve real tasks (like https://www.swebench.com/), but it feels like contamination is probably an issue.

2 more comments...

arg min

Discussion about this post

Ready for more?