Discussion about this post

User's avatar
Greg Stoddard's avatar

I like the idea. For topic 5, "Data Hygiene", I might also add some examples of when things can go wrong like leakage between X and y features, or leakage between train and test. Giving students some exposure to how things can go wrong and helping them develop intuition to recognize it feels important (I'm going to entirely set aside how think about leakage for LLMs since that feels like a much harder philosophical and practical problem).

Also, on footnote 3 on the obsession with optimization: random forests don't really optimize train set error for classification either, right? But they still work pretty well in most settings.

Expand full comment
Evan Sparks's avatar

Love this! Couple of ideas - you might decide to abandon theory from this course and focus on practice (with a few theory deep dives). You’re missing feature engineering from the curriculum but of course some form of feature engineering (even via svm kernel or autoencoder or more conventional things like sifts) is usually required. You could either figure out a way to say these are all different sides of the same coin or say they’re different and talk about the tradeoffs. You could squeeze Svd or some other low rank approximation methods into this part which could allow for a connection to current next token prediction problems without going into the weeds on attention blocks or whatever.

I think a good chunk of time could be spent on “framing problems in ways that they can be solved by machine learning” if someone sees a new problem, how do they choose an appropriate optimization objective, choose a loss, and then understand if they have the right kind of data to even build a model. As part of this you introduce eval and talk about how the thing you actually care about may not be the thing you can optimize, and how to think about evaluation in this context.

Expand full comment
14 more comments...

No posts