Let’s loop back to where I started with bitter lesson blogging. The question is whether computer science has any content anymore. Let me remind you of Shital Shah’s initial tweet:
It's year 2018. You walk into Algorithms and Data Structures class. You tell students to just use whatever algorithm first comes to their mind, throw in a ton of compute and call it scaling. ‘That's a bitter lesson for you all’, you say, and leave the classroom.
I’m not teaching data structures, but I am teaching graduate machine learning again this fall. So this tweet hit hard. If the only design principle in machine learning is Scale-HODL-YOLO, then I should cancel my class and remove it from the course offerings. While I’m at it, I should lobby my department to give in and turn our graduate program into a business school that only teaches entrepreneurship classes.1
So the last few weeks of thinking out loud on argmin.net have been me trying to justify my pedagogical existence. Nearly a decade ago, Moritz Hardt and I redesigned the department’s graduate machine learning class to be a “second class” in machine learning.2 The department also offers a mezzanine class for grad students who have never taken a machine learning class before. Is there still a reason for a second class in machine learning? Is there even a reason for a first class in machine learning? There’s a high school track at NeurIPS now! Has the field been so simplified that all you need is to vibecode pytorch with a chatbot?
I’ve pep-talked myself into thinking that the answer is no. Despite the rhetoric on Twitter, I think machine learning does have design principles beyond Scale-HODL-YOLO. I think these design principles are worth explaining and yield better machine learning systems. Many of these get skipped at the undergraduate level. Most of them are already in the book Moritz and I wrote, and I just need to fine-tune the presentation a bit.
Philosopher Paul Feyerabend is famous for his assertion that when it comes to the scientific method, anything goes. Certainly, there’s a degree to which this is even more true in engineering, as metrics often dictate abandoning rigid principles. And even more certainly, to any engineer who has ever looked at machine learning sausage making, machine learning appears to be the worst kind of Feyerabendian nightmare. I’ve argued here this week that when it comes to the guts of a prediction system, you really can do whatever you want. However, this sort of nihilism, which I think Sutton’s essay has only encouraged, overlooks the fact that there are a few very clear and powerful design principles in machine learning.
First, machine learning is inherently a field of optimization. It is not the clean optimization that you learn in an operations research class, where tight, simple mathematical formulas specify everything you need to know about a problem. Instead, machine learning is driven by engineering objective functions. We minimize classification errors. We maximize precision or recall. We find directions that maximally discriminate between two populations. We can even maximize the win-rate in games.
Artificial Intelligence sages propose lofty sci-fi nonsense like
Build an agent to go out into the world and learn from scratch in any environment without any prior knowledge
We do not have any algorithms to do this. However, if you get a bunch of nerds together in a room, they might come up with a reformulation:
Take a pretrained LLM and fine-tune it to produce samples that have maximal chance to be accepted by a verifier, be it a test rubric or a compiler.”
This is something we can do. Both of these problems are called “reinforcement learning.” One is a fairy tale pipe dream, the other is deployed by every large LLM company. We can only build “AI” through optimizing.
Second, machine learning has a robust culture of evaluation that too frequently is chastised. I’d argue that evaluation is a strength of machine learning as a discipline. The train-test paradigm is a powerful methodology that allows engineers to make incremental improvements on metrics they want to optimize. Creating data sets that represent optimization problems, though an art, has generalizable principles. There are lessons we can learn from examining practice about what makes a dataset a good benchmark.
The goal of the class is thus to answer a tightly scoped set of questions: What are the design principles of machine learning? What are the fundamental optimization problems of machine learning? What are our general methods for solving these problems? What are our methods for evaluating performance? Which problems is this engineering framework ill-suited to solve?
Although I love Feyerabendian chaos, he’d be the first to admit that rules are necessary to build broad communities of expertise that can build interesting things. We should all know to ask first, “Have you tried logistic regression?” We should get a feel for when you really need to break out the transformer. Pure complexity and obsession with “more is more” creates an illusion of sorcery. We should understand the perils of technical debt that arise from excessive complexity.
And while I’m not sure I’ll be successful, I hope this motivates students to further probe some of the existing “anything goes” guts of machine learning systems for a cleaner understanding of our available software pipelines. I think at least a semblance of tighter principles will lead to better machine learning.
The challenge of the class is to avoid being blinded by survivor bias. It is not clear that the blind nihilism of leaderboard chasing is an operational maximum. Let’s understand what it’s good for and how to do it effectively. Let’s think about how to sensibly incorporate scaling in data-driven prediction systems. Let’s think more carefully about how to specify what we want from these systems. And most importantly, let’s see if we can figure out how to tell when machine learning is working and what to do when it doesn’t.
Stay tuned for a full syllabus next week…
My impression is many of my colleagues wouldn’t object to this.
For the heads, we took a class that had been entirely about graphical models and aligned it to reflect contemporary technological developments in the field.
"Build an agent to go out into the world and learn from scratch in any environment without any prior knowledge" - Why is this sci fi nonsense? We don't have any algorithms for this, yet. Do you believe this isn't possible in principle? Why?
Have really been enjoying your trajectory here as I also struggle with these ideas while bracing for the new semester.
I am curious if you will engage with, or think your students ought to engage with, the like social & economic reasons (or consequences) for the prominence of 'bitter lesson' thinking? Is it really just pure chance or poor pedagogy have led to this belief that access to capital, not knowledge, is most important?