Machine Learning Evaluation - A Syllabus

What we read this semester, and what worked, and what didn't.

Apr 24, 2025

I’ve received several requests over the semester for the syllabus from our class on Machine Learning Evaluation. There were two reasons I didn’t readily share it. First, I made the mistake of relying too heavily on Canvas. Using Canvas means that course content lives behind an obnoxious campus IT firewall. I’m not going to do this again. From now on, my course materials will be on public webpages, and I’ll use Canvas only for announcements and assignments. You’d think I’d have figured this out by this point in my academic career, but I suppose we are constantly honing our craft. This is just a note-to-self that I’m putting on the blog. Hold me to the commitment next semester, dear readers.

The second reason is perhaps a bit more justifiable: Deb and I were making up the course as we went along. This is the way a good grad seminar should go, right? We all figure out the boundaries of an intellectual field together. It felt a bit premature to be sharing the syllabus before we knew what the course was. We’re at the end of the semester, and I’m still not 100% sure I know what the course is. It usually takes me three iterations to solidify a class, and after that, I never want to teach it again. But after the first iteration of machine learning evaluation, I still feel very much at the beginning of figuring out how to teach this topic properly.

With these disclaimers, let me share what we ended up reading. Click the link below:

Machine Learning Evaluation

Ben Recht

Apr 23

Syllabus for Spring 2025

Read full story

At some point, I will tidy up these citations to be neatly in APA format. I suppose I could ask ChatGPT to do that for me, no? But hopefully the links here all work, and there aren’t too many egregious citation typos. I’ve also added links to the relevant blog posts where I thought out loud about the class. These add some context about what we actually ended up discussing in seminar.

Even with this context, the syllabus doesn’t tell the entire story of how we worked through the class, but it’s a decent skeleton. Reviewing it this morning, I think much of this structure would remain the same if I were to teach the course again. I might move around a couple of the topics next time. Certainly, some of these topics were presented in a suboptimal order due to constraints in the instructors’ schedules. But I wouldn’t add anything. If anything, I’d want to expand upon some of the parts that felt a bit rushed and perhaps drop a few topics that seemed to add more heat than light.

Up through week 5 is a tight story, more of how machine learning and prediction systems have traditionally been evaluated. Week 6 on construct validity was also illuminating, and I’d make this section two weeks, adding in a bit more philosophy of science. Specifically, I would spend more time on the problem of induction and Lakatosian Defense in science and engineering.

This brings us to the three weeks on uncertainty. In hindsight, Deb and I would reshuffle the material in Weeks 7-9, and we would certainly change the reading assignments. For example, I’d drop the survey of scoring rules by Gneiting and Raftery, which, while establishing interesting mathematical foundations, doesn’t give much insight into why we should use scoring rules in the first place.

That said, the uncertainty block was the part of the course where I learned the most. Namely, what is so interesting about uncertainty quantification is how evaluation metrics dictate the algorithms we use, both for making predictions and reporting uncertainty. The metric itself dictates how to make predictions from the data you have thus far recorded. My conclusion is that uncertainty-quantified prediction is nothing but defensive bookkeeping. I need to write this down in detail before teaching the course again. I’m currently writing up a paper on this viewpoint with Juanky Perdomo.

The most controversial week of the class was the one where we tried to discuss “contemporary” machine learning benchmarks. Sigh. I’d probably throw out all of this reading. The shelf life of “contemporary” evaluation in machine learning seems to be a week. One way of viewing this is that AI is too good now, and no benchmark is safe. Another view is that creating and evaluating benchmarks is challenging, and you can’t just throw together a random dataset and declare it a benchmark. The right perspective is probably in the middle.

However, it’s impossible to tell because the new papers on evaluation all have less content than a tweet. It’s a rough state of affairs! Even the two papers we assigned as mandatory reading don’t say much beyond “depending on how you format your axes, you can convince yourself of anything you want about LLMs.” This can win you a best paper award.

From a practical perspective, for “frontier models,” it’s unfortunately impossible to decouple their evaluation from the billions of dollars being thrown at them. We might have to be patient and wait for the craze to pass before we make sense of them. Perhaps we’ll be ready when I teach this class again in 2027.

The LLM unit at least motivated why we need to care about social program evaluation in machine learning, and I thought the two weeks spent on RCTs and program evaluation were solid, though a bit rushed. It’s a lot to pack into a semester!

Finally, missing from this syllabus are readings from my forthcoming book. I’m not at liberty to share them widely just yet. It will be published later this year, and I’ll add it to the syllabus when it becomes available. I will make an official announcement soon, I promise!

The most positive signal from the class is that writing this post has convinced me that I want to teach it again. It’s always a good sign when I come away from the class with new research projects, and I have more than a few germinating. Evaluation is so critical to machine learning engineering, and yet our courses tend to just teach methods. What’s funny about the current state of affairs is the atheoretical nature of machine learning means evaluation is all we have. Machine learning is most helpful when we don’t understand some phenomenon, but how can we evaluate outcomes without a firm causal theory of how entities are linked to each other? That’s the central question we grappled with this semester, and it’s one we’ll continue to grapple with moving forward.

arg min

Machine Learning Evaluation

Discussion about this post