Machine Learning Evaluation

A first draft of a syllabus

Jan 21, 2025

This semester, Deb Raji and I are teaching a graduate seminar on machine learning evaluation. The motivating question is, “What does it mean for machine learning to work?” The fun part is I have no idea.

Let me try to box in the scope of the semester. I’ll use machine learning as an umbrella term for statistical prediction. Any systems that make predictions from examples are machine learning systems.

What about evaluation? An evaluation measures the difference between our articulated expectations of a system and its actual performance. Most of the time, especially in engineering, “articulate” means “quantify.” We specify measurements and metrics so that we can numerically calculate the deviation between expectation and performance.

The quantified metric seldom concerns a singular, well-specified situation. Engineering systems must perform well in a variety of instances of interest. Hence, we judge systems by their performance on average over a set of scenarios.

Once we’ve quantified the cases we care about and how we will score our performance on them, we’ve set a goal to be optimized. Everyone wants a system whose performance matches expectations!

To make clean stories, articulation becomes quantification. To handle diverse cases, quantification becomes statistical. To achieve expectations, design becomes optimization. This view of evaluation frames engineering and policy decisions as choosing the action most likely to maximize an average. It’s a short jump from lofty aspirations of holistic evaluation to crude statistical prediction. In other words, engineering evaluation is often machine learning evaluation in disguise.

This paradigm of engineering with statistical prediction is powerful. It’s the motivating problem in statistical signal processing and stochastic control. It’s at the core of rational choice theory in economics. Any time we make a decision based on a randomized control trial, we’re deferring to an inherent prediction problem where the cost-benefit analysis seen in the trial is representative of what will happen after the trial.

And so, the course hypothesizes that through understanding machine learning evaluation, we’re going to get some insights into evaluation more generally. We also hope to glean insights as to why machine learning seems to be eating the world of engineering. Even though the semester focuses on machine learning, I’m after the broader epistemological question: What do engineers know, and how do they know it?1 This question was posed by Walter Vincenti in his book of the same title. Vincenti focused on aerospace; I’m interested in expanding the scope of his case studies by looking at information technology.

And there’s still plenty we don’t understand about evaluating machine learning. Why do static benchmarks drive progress in machine learning? Would similar benchmarks work in other fields? What makes a good benchmark? Is it just zeitgeist? Is there an alternative to competitive testing in evaluating predictions? Is it good enough to run some experiments in the lab and make tables showing that our new thing is better than state-of-the-art? Is it good enough to film some cool demos of robots doing some simple tasks (conveniently not showing the takes where the robot decided not to work)? Is it good enough to publish statistical metrics showing that our self-driving car fleet hasn’t killed anyone yet?

I have a lot of questions and few answers at this point. But hey, we have a whole semester to figure some of these things out.

Deb and I made an outline of the topics we want to cover, but my prediction intervals around this outline are very wide after week four. I expect a lot of changes. We’re going to cover the theory and practice of the holdout method. We’ll examine how this manifested itself in the machine learning benchmark and study how and why this drove progress. We’ll ask about the validity of such benchmarking and where it falls short. We’ll look at how forecasters evaluate predictions and dive into the subtle weirdness of scoring rules and calibration. We’ll look at how games function as machine learning benchmarks and how randomized experiments fit into this picture as well. And we’ll look at the heated debate over evaluating whatever these new AI monster models are doing. From this topic list, it’s clear the class will be a methodological mess. It will have some of the math we associate with statistical learning theory, but we’ll also have to engage with philosophy, history, and policy. Anyway, I’m excited about it.

I’m not going to commit to one of my “course live blogs” this semester for a couple of reasons. First, I have a bunch of other topics I’m trying to think through this semester. Second, I’m not sure every lecture/discussion in the class will be bloggable. However, I’ll periodically report in on how things are going. I’m curious to see if I have gleaned any more clarity about evaluation on May 1 than I do now. I suppose that means I need to articulate my expectations to properly benchmark myself.

Deb has her own motivating questions, and she may chime in here on the blog too from time to time.

∂jalel

Jan 22

Wonderful topic. I think I reached a point where I would rarely recommend ML for anything with high stakes or high risks. This contrasts so much with my enthusiasm for ML when I started my PhD 15 years ago 😳.

Expand full comment

Damek Davis

I can’t believe you waited until AFTER I left Berkeley to teach this.

1 reply by Ben Recht

14 more comments...

arg min

Discussion about this post