Machine Learning 101?

Feb 13

Imagining a new syllabus for a first course on machine learning.

17 Comments

Love this! Couple of ideas - you might decide to abandon theory from this course and focus on practice (with a few theory deep dives). You’re missing feature engineering from the curriculum but of course some form of feature engineering (even via svm kernel or autoencoder or more conventional things like sifts) is usually required. You could either figure out a way to say these are all different sides of the same coin or say they’re different and talk about the tradeoffs. You could squeeze Svd or some other low rank approximation methods into this part which could allow for a connection to current next token prediction problems without going into the weeds on attention blocks or whatever.

I think a good chunk of time could be spent on “framing problems in ways that they can be solved by machine learning” if someone sees a new problem, how do they choose an appropriate optimization objective, choose a loss, and then understand if they have the right kind of data to even build a model. As part of this you introduce eval and talk about how the thing you actually care about may not be the thing you can optimize, and how to think about evaluation in this context.

Expand full comment

Reply (1)

Ben Recht

Feb 14

Love it. For me, "feature engineering" is an important part of what I labeled as "data structures." A lot of people like to say that neural nets "learn" representations and features. So I'm lumping it all---architecture tuning, feature engineering, thinking about signal processing---under this bad term of "data structures."

I wonder what the best way would be to get into problem framing, optimization design, and evaluation. It might just be through examples. A few case studies could be very illuminating here.

Expand full comment

Greg Stoddard

Feb 13

I like the idea. For topic 5, "Data Hygiene", I might also add some examples of when things can go wrong like leakage between X and y features, or leakage between train and test. Giving students some exposure to how things can go wrong and helping them develop intuition to recognize it feels important (I'm going to entirely set aside how think about leakage for LLMs since that feels like a much harder philosophical and practical problem).

Also, on footnote 3 on the obsession with optimization: random forests don't really optimize train set error for classification either, right? But they still work pretty well in most settings.

Expand full comment

Reply (1)

Ben Recht

Feb 13

Absolutely. There should be a whole lecture on leakage and bad data partitioning.

Curious: Why do you say that random forests don't optimize?

Expand full comment

Reply (1)

Greg Stoddard

Feb 14

Hmm, I suppose I misspoke. Random forests don't optimize log-loss, but they still seem to work quite well for optimizing log-loss (perhaps they require a calibration step after fitting, but still). But I think you're implicitly right that they are a fundamentally that kNN that barely looks at the y-data at all when fitting a model.

But Extra Random Forests (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html) exist and they certainly don't optimize!

Expand full comment

Alex Tolley

Feb 13

Unless it has already been taught, you should include an introductory class on what are simple statistical regression techniques and where they work well. Why use a jackhammer when a small hammer is perfectly adequate?

Is Stochastic Gradient Descent necessary?

Data Structures should have lots of examples of where certain techniques work best and where they fail.

And lastly, isn't the aphorism "there is no free lunch" about no universally best technique, still applicable?

Expand full comment

Reply (1)

Ben Recht

Feb 13

Yes, linear models are a must. And I'd focus on hyperplane classifiers and how to find them.

Also, I'm the first to argue that if you can reduce your problem to a linear system solve, you have won. But I think that we end up spending too much time on least-squares and normal equations in undergrad ML, and it distracts from a bigger picture.

So while I agree that SGD is not all you need and is often inefficient, I was just saying that you could get away with only teaching SGD and suggesting that there are other optimization algorithms out there for the students who want to dig deeper.

Expand full comment

Steph

Feb 13

omg absolutely on point 5: I don't come from a computer science background so even when I could get on board with the maths in my pattern recognition ML class, I was really lacking in the whole 'infrastructure' of how to actually run it

Expand full comment

Dan T.

Feb 13

“Split the data into train and test. Do whatever you want on train. Evaluate test error to select the best model”. When I look at online discussions of “holdout method”, this isn’t what they seem to say, although sometimes the “validation set” is used the way you are using the test set. I take it that this is “holdout in practice”, or something. Also, I don’t get it: select from what set of models? Ones that are “roughly the same”? Or models from different categories (NN, decision tree,…) that have about the same performance on the training data? It’s not entirely clear what distinguishes the holdout method from just training on the whole dataset. (Nor did I understand this from previous posts. Speak as you might to a young child, or a golden retriever.)

Expand full comment

Naina Chaturvedi

Sep 28

++ Good Post. Also, start here : 500+ LLM, AI Agents, RAG, ML System Design Case Studies, 300+ Implemented Projects, Research papers in detail

https://open.substack.com/pub/naina0405/p/very-important-llm-system-design?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

Expand full comment

marthasimons

Mar 13

Great intro to machine learning basics! As an AI writer https://eduwriter.ai/ , I appreciate how this breaks down complex concepts into digestible pieces—perfect for anyone looking to understand the fundamentals without getting overwhelmed.

Expand full comment

Yonatan

Mar 2

Interesting list!

As a practicing data scientist, I would emphasize even more the hold-out section. How to choose a hold out set correctly in different data types (time series, tabular, etc.). How to verify your hold out set.

I was "fooled" many times to think a model is doing great just because I created an inappropriate hold out set...

Expand full comment

Alena Mariya

Mar 1

Thank you!

Expand full comment

Neal Parikh

Feb 17

This is fairly similar to how I teach AI/ML in my class in an international and public affairs school, to masters in policy students (the class is called "AI: A Survey for Policymakers" and the link is here if you want: https://nparikh.org/sipa6545). The major difference is obviously that I leave out some of the more directly technical components (there is no math or coding they directly do, though they read technical papers and gloss over certain technical details) and there is significantly increased emphasis on case studies in application domains where policy questions arise (criminal justice, automated employment decision tools, healthcare screening, etc) and other policy-oriented topics (fairness, privacy, etc).

But it is similar in the following respects: I emphasize major categories of tasks (regression, classification, structured prediction) and basically give one canonical method for each just so they recognize the name when they read the paper; there are no laundry lists of methods. I spend time talking about things like confusion matrices and various classification metrics one might use, and things like disaggregated evaluation. I strongly emphasize the holdout method, and a fundamental difference between eg the use of linear regression in social sciences (which some of them have seen) and in ML, and connect it to current debates about leakage in LLM evaluation — I don't go into the weeds on it but just point out this is a major current topic. I show them gradient descent visually (I don't bother with stochastic) and show them how as you fiddle with, say, the slope/intercept of a linear regression model the error rate changes so they can visualize it. This is more to demystify the whole process and explain terms they may have seen in passing than for the actual relevance of GD for these students. I emphasize data hygiene, but go a bit further than what you mention in the ethics direction (eg, if you are doing some kind of education model, the quality of data from poorer schools may be more inconsistent vs richer schools and this may impact model behavior in some "fairness" way, fairness taken a bit loosely here); I also emphasize data shift quite a bit for the same reason.

Anyway, I basically agree and I think for a 101 course, it may be worth taking inspiration from what might superficially appear to be "less technical" courses taught outside CS/stats but that can actually be more conceptually sophisticated because they emphasize many topics that are not covered in a standard ML math kind of course because so much time is spent deriving all the different methods and implementing things.

Expand full comment

Reply (1)

Neal Parikh

Feb 17

For things like train/test splits, there are nice examples that both illustrate the technical point but also illustrate concrete ethics questions and are not toy things but real world case studies. For example, there are some case studies involving child welfare agencies where a split was incorrectly done that splits complaints about a single household and possibly different children across the split, when you want to keep these together for obvious reasons (the examples/rows are complaints not households/kids).

Expand full comment

Mark Johnson

Feb 14

What about this textbook? https://www.statlearning.com/

Before I left academia for start-up & corporate world, that's what I decided to teach in my intro class, for the reasons you mention.

Expand full comment

Feb 13

Sign me up for this class!

Two things worth adding IMO:

- Checking if the test set performance is "suspiciously" good (e.g., "Clever Hans" predictors, in an earlier post, I think you had an radiology example where the NN was picking up some machine artifact). Maybe this is a "Data Drift" thing or a "Data Hygiene" thing?

- For situations where the end-user needs some "comfort" about what the heck the model is doing, maybe some coverage of interpretability?

Expand full comment

arg min

Machine Learning 101?