Machine Learning 101?

Imagining a new syllabus for a first course on machine learning.

Feb 13, 2025

What should we teach in an undergrad machine learning class? My recent posts have been about how undergrad machine learning classes have distilled a lot of bad advice into people’s brains. Given what we know about contemporary practice, I have some half-baked ideas for how we might restructure our curriculum.

Let me say right out front that I have not taught this proposed class yet, and all of these ideas remain very hazy. If you think these are wrong or I have omitted essential topics, please comment below! This blog post is looking for feedback.

I have four motivating principles. First, let’s leave AI out of this. There can be courses on machine learning and courses on AI. And let’s give up the pretense that machine learning is anything other than pattern recognition. Supervised learning (and perhaps “weakly” or “self” supervised learning) is more than enough to fill a semester. This means no unsupervised learning. A course on machine learning would be about computerized pattern recognition and what you need to know to properly deploy it.

Second, let’s avoid laundry lists. I’ve seen syllabi where one week is spent on naive Bayes, the next week on SVM, the next week on random forests, the next week on neural nets. I loved it when I learned it the first time, but I think teaching SVM duality in a machine learning class is confusing and unnecessary. We shouldn’t teach machine learning like a serial recipe book.1

Third, part of machine learning has to be about understanding the current API and how to navigate it. It’s bad to deny that the practicalities of machine learning in 2025 are very different than in 2005. The software is mature and easy to run. There are open-source models that you can download. Does a first course need to teach you how to code XG Boost from scratch? I think not. However, students should know XGBoost exists and how it might be appropriate for their prediction problems. Similarly, they should know they don’t need to train a neural network from scratch to get a good starting model for image classification. Pretrained models exist and are accessible.

Fourth, the crown jewel of machine learning is the hold-out method.

Given these four, my vision of a semester would be something like this:

Overview. The fundamental assumption of machine learning is that pattern recognition is possible. We take this as an axiom. Then the technical question is: What heuristics reasonably work to program computers to recognize those assumed patterns?
The Holdout method. The hold-out method should be the first thing we teach. Split the data into train and test. Do whatever you want on train. Evaluate test error to select the best model. We can even teach a bit of statistical learning theory here. There’s a simple argument based on Hoeffding’s inequality that shows why the holdout method is robust and gives a rule of thumb for how large a test set needs to be.2 You can use this argument to talk about hypothetical populations and how to think a bit about sampling.
Stochastic Gradient Descent. If you believe in the holdout method, anything goes on the training set. Now, all anyone does to build models is minimize error on the training set.3 I have a simple motivation for minimizing training error. If you had all the data in the world, minimizing training error would be the right thing to do (i.e., the best predictor is the regression function). However, you only get a subset of the training data, so there will always be many possible fits. The best of these will be selected with your test set. Once we motivate minimizing training error, this leads to a short introduction to optimization. A tutorial on stochastic gradient descent would probably suffice. Do we need to cover automatic differentiation in this class? My gut says no.
Data structures. No, not linked lists. Here I think it’s worth spending time on patterns we believe are in data and heuristics we might use to efficiently learn those patterns. For example, we could talk about what we think “tabular data” means, what cross products of columns are, and introduce decision trees and random forests. We could talk about sequence data, motivating next-token prediction and transformers. We could talk about image data, motivating either transformers or conv nets. The point here would be to think about data types and how specific functional structures are efficient ways to represent pattern recognition predictors.
Data hygiene. A good chunk of the course would be around data hygiene and APIs. You have to figure out how to make your data into something machine readable. You need to know which software can quickly take your data and make good predictions. You need to know about estimated run times and scaling. And you’d need to know about fine tuning. This would mean a lot of playing with sklearn and pytorch and huggingface.
Data shift. We should teach that our machine learning models likely won’t be perfect predictors forever. The data out in the world differs from the data you collected to train your model. Even if your model works today, it might not work tomorrow. You should anticipate models having shelf lives. It would be helpful to ask, “What do we think is predictable and for how long?” We could motivate the industrial heuristics of retraining and describe how people tend to blend models on different time scales to mitigate these shifts in what we believe about the future.

This could be a fun and useful semester course. It’s still half-baked, but I think it’s ready to pilot somewhere. What do you all think? The only part I’m sure of is I will never again show this plot to my undergrads.

I do think it’s possible to take inspiration from The Cocktail Codex when teaching the zoo of machine learning models. This is another half-baked idea I should blog about.

Csaba Szepesvari reminded me that this is in Chapter 8 of Devroye, Györfi, and Lugosi! It’s classic.

I legitimately believe this optimization obsession ML is cultural and not necessary. Nearest neighbors, for example, doesn’t optimize. But this is a research conjecture and not fleshed out enough for an undergrad class.

Evan Sparks

Feb 13

Love this! Couple of ideas - you might decide to abandon theory from this course and focus on practice (with a few theory deep dives). You’re missing feature engineering from the curriculum but of course some form of feature engineering (even via svm kernel or autoencoder or more conventional things like sifts) is usually required. You could either figure out a way to say these are all different sides of the same coin or say they’re different and talk about the tradeoffs. You could squeeze Svd or some other low rank approximation methods into this part which could allow for a connection to current next token prediction problems without going into the weeds on attention blocks or whatever.

I think a good chunk of time could be spent on “framing problems in ways that they can be solved by machine learning” if someone sees a new problem, how do they choose an appropriate optimization objective, choose a loss, and then understand if they have the right kind of data to even build a model. As part of this you introduce eval and talk about how the thing you actually care about may not be the thing you can optimize, and how to think about evaluation in this context.

Expand full comment

1 reply by Ben Recht

Greg Stoddard

I like the idea. For topic 5, "Data Hygiene", I might also add some examples of when things can go wrong like leakage between X and y features, or leakage between train and test. Giving students some exposure to how things can go wrong and helping them develop intuition to recognize it feels important (I'm going to entirely set aside how think about leakage for LLMs since that feels like a much harder philosophical and practical problem).

Also, on footnote 3 on the obsession with optimization: random forests don't really optimize train set error for classification either, right? But they still work pretty well in most settings.

2 replies by Ben Recht and others

14 more comments...

arg min

Discussion about this post