Love this! Couple of ideas - you might decide to abandon theory from this course and focus on practice (with a few theory deep dives). You’re missing feature engineering from the curriculum but of course some form of feature engineering (even via svm kernel or autoencoder or more conventional things like sifts) is usually required. You could either figure out a way to say these are all different sides of the same coin or say they’re different and talk about the tradeoffs. You could squeeze Svd or some other low rank approximation methods into this part which could allow for a connection to current next token prediction problems without going into the weeds on attention blocks or whatever.
I think a good chunk of time could be spent on “framing problems in ways that they can be solved by machine learning” if someone sees a new problem, how do they choose an appropriate optimization objective, choose a loss, and then understand if they have the right kind of data to even build a model. As part of this you introduce eval and talk about how the thing you actually care about may not be the thing you can optimize, and how to think about evaluation in this context.
Love it. For me, "feature engineering" is an important part of what I labeled as "data structures." A lot of people like to say that neural nets "learn" representations and features. So I'm lumping it all---architecture tuning, feature engineering, thinking about signal processing---under this bad term of "data structures."
I wonder what the best way would be to get into problem framing, optimization design, and evaluation. It might just be through examples. A few case studies could be very illuminating here.
I like the idea. For topic 5, "Data Hygiene", I might also add some examples of when things can go wrong like leakage between X and y features, or leakage between train and test. Giving students some exposure to how things can go wrong and helping them develop intuition to recognize it feels important (I'm going to entirely set aside how think about leakage for LLMs since that feels like a much harder philosophical and practical problem).
Also, on footnote 3 on the obsession with optimization: random forests don't really optimize train set error for classification either, right? But they still work pretty well in most settings.
Hmm, I suppose I misspoke. Random forests don't optimize log-loss, but they still seem to work quite well for optimizing log-loss (perhaps they require a calibration step after fitting, but still). But I think you're implicitly right that they are a fundamentally that kNN that barely looks at the y-data at all when fitting a model.
Unless it has already been taught, you should include an introductory class on what are simple statistical regression techniques and where they work well. Why use a jackhammer when a small hammer is perfectly adequate?
Is Stochastic Gradient Descent necessary?
Data Structures should have lots of examples of where certain techniques work best and where they fail.
And lastly, isn't the aphorism "there is no free lunch" about no universally best technique, still applicable?
Yes, linear models are a must. And I'd focus on hyperplane classifiers and how to find them.
Also, I'm the first to argue that if you can reduce your problem to a linear system solve, you have won. But I think that we end up spending too much time on least-squares and normal equations in undergrad ML, and it distracts from a bigger picture.
So while I agree that SGD is not all you need and is often inefficient, I was just saying that you could get away with only teaching SGD and suggesting that there are other optimization algorithms out there for the students who want to dig deeper.
omg absolutely on point 5: I don't come from a computer science background so even when I could get on board with the maths in my pattern recognition ML class, I was really lacking in the whole 'infrastructure' of how to actually run it
“Split the data into train and test. Do whatever you want on train. Evaluate test error to select the best model”. When I look at online discussions of “holdout method”, this isn’t what they seem to say, although sometimes the “validation set” is used the way you are using the test set. I take it that this is “holdout in practice”, or something. Also, I don’t get it: select from what set of models? Ones that are “roughly the same”? Or models from different categories (NN, decision tree,…) that have about the same performance on the training data? It’s not entirely clear what distinguishes the holdout method from just training on the whole dataset. (Nor did I understand this from previous posts. Speak as you might to a young child, or a golden retriever.)
I discovered AI in early 1982, when I chanced on Doug Hofstadter’s Pulitzer-winning first book — still his best — “Godel, Escher, Bach: An Eternal Golden Braid” in a book fair in New Delhi, India. It changed my life. I’ve spent the better part of the past almost four decades in AI in research in ML and AI, seeing it emerge from the shadows being studied by a small number of enthusiasts like me, to today, where it seems to rule the tech industry as a trillion dollar technology.
Then, as now, the biggest problem with AI is how to make accurate predictions. Herbert Simon, one of the founders of the field, made some wild predictions of where the field would be in 10 years, back in the 1960s. He was wrong, by a factor of about 50 years. But, as predictions go, that’s not all that bad.
Most sci fi flicks of that era predicted flying cars by now — recall the beginning scene of Blade Runner — and other than in Chitty Chitty Bang Bang and some James Bond movies, we don’t seem remotely close to getting flying cars. Compared to such predictions, AI has done rather well. Thanks to a whole host of related inventions, from the smartphone to the internet and cloud computing, the reach of AI is more pervasive than ever. Where will AI be in the next 50 years?
This gets to my answer to the question. 40 years ago, I found machine learning the most fascinating field I could possibly study. I don’t think that any longer. The strengths and weaknesses of machine learning have become apparent in the ensuing decades. It’s best to explain this by an analogy, and I love analogies (as does Doug Hofstadter, as his most recent book “Surfaces and Essences” is subtitled “Analogy as the fuel and fire of thinking”).
Imagine you are fascinated, as our ancients were, by the possibility of manned human-powered flight. Every culture known to me has humans soaring in the air like birds in their mythology. In Greek mythology, Daedalus invented wings of wax to help him and his son, Icarus, escape from imprisonment. Sadly, Icarus flew too close to the sun, not heeding his father’s warning, and perished to his death. Indian mythology is riddled with stories of flying machines.
We now have flying machines that whisk us across continents at the speed of sound. But, we need huge airports, mile long runways, jet fuel, seat belts, and all the paraphernalia of modern air travel (don’t get me started on TSA background checks). Where’s our dream of human powered flight, soaring like birds? Gone into mythology, where it shall remain.
ML is in a similar state. Many of us 40 years ago dreamed our machines would learn like us, like children, curious about the world, learn fluency in many languages, help us in our old age, and become our intellectual companions. Alas, that’s largely a pipe dream.
Modern ML, like modern air travel, is a completely different enterprise. It needs huge labeled datasets (now in the petabytes). It’s notoriously brittle, as recent single pixel attacks have shown how vulnerable deep learning, our best ML technology, is. If you cater to its every whim, it can be successful, but it is no match for human learning, as a modern 747 is no match for the common garden sparrows that flit about my backyard.
So, I am curious whether AI will ever reach a state when it will lead to truly intelligent machines that can soar in the sky, like birds do without all the trappings of modern airliners, or will it for ever be consigned to the same fate, needing mile long runways, jet fuel, and seat belts and TSA background checks.
So, like the great MLK, I too have a dream. I dream of the day when machine learners will become like human learners, be like children, eternally curious about the world, not be dependent on terabyte sized labeled datasets, and careful human parameter and architecture tweaking. Is this a pipe dream? Will we get to this promised land? Or, like modern air travel, is this to be our fate to be relegated to intrusive TSA checks when we feel like soaring like the birds?
The best way, in my view, to understand a field is to understand the reason why the field exists in the first place, Why do we need a field like machine learning? In short, what problems does it solve and why?
Let’s start with an analogy, something you do practically every morning: you wake up and get ready to go to work. What problems do you need to solve? For one, you need to put on some clothes to protect your body from the weather and your feet against the rough surfaces you might encounter. You need to perhaps cover your head with a hat or a scarf and protect your eyes with sunglasses against the harsh rays of the sun. These are the problems we need to solve in getting dressed.
Algorithms are like clothes and shoes, hats and scarves and sunglasses, continuing the analogy from above. You could wear sneakers, dress shoes, or high heels. You could wear a T-shirt, a dress shirt, a full length skirt and so on. Clothes and shoes are ways to solve the problem of dressing up for work. Which clothes you wear and what shoes you put on may vary, depending on the occasion and the weather. Similarly, which machine learning algorithm you use may depend on the problem, the data, the distribution of instances etc. The lesson from the fashion industry is quite apt and worth remembering. Problems never change (you always need something to cover your feet), but algorithms change often (new styles of clothes and shoes get created every week or month). Don’t waste time learning fashionable solutions when they will become like yesterday’s newspaper. Problems last, algorithms don’t!
There’s a tendency, unfortunately, of recommending universal solutions to machine learning these days (e.g., learn TensorFlow and code up every algorithm as stochastic gradient descent using a deep neural net). To me, this makes just about as much sense as wrapping yourself up in your bedsheets to go to work. Sure, it covers most parts of your body, and probably could do the job, but it’s a one size fits all approach that neither shows any style or taste, nor any understanding of the machine learning (or dressing) problem.
The machine learning community has spent over four decades trying to understand how to pose the problem of machine learning. Start by understanding a few of these formulations, and resist the temptation to view every machine learning problem through a simplified lens (like supervised learning, one of dozens of ways of posing ML problems). The major categories include unsupervised learning, the most important, followed by reinforcement learning (learning by trial and error, the most widely prevalent in children after unsupervised learning), and finally supervised learning (which occurs rather late, because it requires labels and language, which young children mostly lack in early years). Transfer learning is growing in importance as labeled data is expensive and hard to collect for every new problem. There’s lifelong learning, and online learning, and so on. One of the deepest and most interesting areas of machine learning is the theory of probably approximately correct (or PAC) learning. This is a fascinating area, which looks at the problem of how we can give guarantees that a machine learning algorithm will work reliably or will produce a sufficiently accurate answer. Whether you understand PAC learning or not tells me if you are a ML scientist, or an ML engineer.
The most basic formulation of machine learning, and the one that gets short shrift in many popular expositions, is learning a “representation”. What does this even mean? Take the number “three”. I could write it using three strokes III, or as 11, or as 3. These correspond to the unary, binary, and decimal representations. The latter was invented in India more than 2000 years ago. Remarkably, the Greeks, for all their wisdom, never discovered the use of 0 (zero), and never invented decimal numbers. Claude Shannon, the famed inventor of information theory, popularized binary representations for computers in a famous MS thesis at MIT in the early part of the 20th century.
What does it mean for a computer to “learn” a representation? Take a selfie and imagine writing a program to identify your image (or your spouse or your pet) from the image. The phone uses one representation for the image (usually something like JPEG, which is mathematically called the Fourier basis). It turns out this basis is a terrible representation for machine learning. There are many better representations, and new ones get invented all the time. A representation is like the material that makes up your dress. There’s cotton and polyester and wool and nylon. Each of these has its strengths and weaknesses. Similarly, different representations of input data have their pros and cons. Resist the temptation to view one representation as superior to all the others.
Humans spend most of their day solving sequential tasks (driving, eating, typing, walking, etc.). All of these require making a sequence of decisions, and learning such tasks involves reinforcement learning. Without RL, we would not get very far. Sadly, all textbooks of ML ignore this most basic and important area, to their discredit. Fortunately, there are excellent specialized books that cover this area.
Let me end with two famous maxims from the legendary physicist Richard Feynman about learning a topic. First, he said: “What I cannot create, I do not understand”. What he meant that unless you can recreate an idea or an algorithm yourself, you probably haven’t understood it well enough. Second, he said: “Know how to solve every problem that has already been solved”. This second maxim is to make sure you understand what has been done previously. For most of us, these are hard principles to follow, but to the extent you can follow them, you will find your way to complete mastery over any field, including machine learning. Good luck!
Great intro to machine learning basics! As an AI writer https://eduwriter.ai/ , I appreciate how this breaks down complex concepts into digestible pieces—perfect for anyone looking to understand the fundamentals without getting overwhelmed.
As a practicing data scientist, I would emphasize even more the hold-out section. How to choose a hold out set correctly in different data types (time series, tabular, etc.). How to verify your hold out set.
I was "fooled" many times to think a model is doing great just because I created an inappropriate hold out set...
This is fairly similar to how I teach AI/ML in my class in an international and public affairs school, to masters in policy students (the class is called "AI: A Survey for Policymakers" and the link is here if you want: https://nparikh.org/sipa6545). The major difference is obviously that I leave out some of the more directly technical components (there is no math or coding they directly do, though they read technical papers and gloss over certain technical details) and there is significantly increased emphasis on case studies in application domains where policy questions arise (criminal justice, automated employment decision tools, healthcare screening, etc) and other policy-oriented topics (fairness, privacy, etc).
But it is similar in the following respects: I emphasize major categories of tasks (regression, classification, structured prediction) and basically give one canonical method for each just so they recognize the name when they read the paper; there are no laundry lists of methods. I spend time talking about things like confusion matrices and various classification metrics one might use, and things like disaggregated evaluation. I strongly emphasize the holdout method, and a fundamental difference between eg the use of linear regression in social sciences (which some of them have seen) and in ML, and connect it to current debates about leakage in LLM evaluation — I don't go into the weeds on it but just point out this is a major current topic. I show them gradient descent visually (I don't bother with stochastic) and show them how as you fiddle with, say, the slope/intercept of a linear regression model the error rate changes so they can visualize it. This is more to demystify the whole process and explain terms they may have seen in passing than for the actual relevance of GD for these students. I emphasize data hygiene, but go a bit further than what you mention in the ethics direction (eg, if you are doing some kind of education model, the quality of data from poorer schools may be more inconsistent vs richer schools and this may impact model behavior in some "fairness" way, fairness taken a bit loosely here); I also emphasize data shift quite a bit for the same reason.
Anyway, I basically agree and I think for a 101 course, it may be worth taking inspiration from what might superficially appear to be "less technical" courses taught outside CS/stats but that can actually be more conceptually sophisticated because they emphasize many topics that are not covered in a standard ML math kind of course because so much time is spent deriving all the different methods and implementing things.
For things like train/test splits, there are nice examples that both illustrate the technical point but also illustrate concrete ethics questions and are not toy things but real world case studies. For example, there are some case studies involving child welfare agencies where a split was incorrectly done that splits complaints about a single household and possibly different children across the split, when you want to keep these together for obvious reasons (the examples/rows are complaints not households/kids).
- Checking if the test set performance is "suspiciously" good (e.g., "Clever Hans" predictors, in an earlier post, I think you had an radiology example where the NN was picking up some machine artifact). Maybe this is a "Data Drift" thing or a "Data Hygiene" thing?
- For situations where the end-user needs some "comfort" about what the heck the model is doing, maybe some coverage of interpretability?
Love this! Couple of ideas - you might decide to abandon theory from this course and focus on practice (with a few theory deep dives). You’re missing feature engineering from the curriculum but of course some form of feature engineering (even via svm kernel or autoencoder or more conventional things like sifts) is usually required. You could either figure out a way to say these are all different sides of the same coin or say they’re different and talk about the tradeoffs. You could squeeze Svd or some other low rank approximation methods into this part which could allow for a connection to current next token prediction problems without going into the weeds on attention blocks or whatever.
I think a good chunk of time could be spent on “framing problems in ways that they can be solved by machine learning” if someone sees a new problem, how do they choose an appropriate optimization objective, choose a loss, and then understand if they have the right kind of data to even build a model. As part of this you introduce eval and talk about how the thing you actually care about may not be the thing you can optimize, and how to think about evaluation in this context.
Love it. For me, "feature engineering" is an important part of what I labeled as "data structures." A lot of people like to say that neural nets "learn" representations and features. So I'm lumping it all---architecture tuning, feature engineering, thinking about signal processing---under this bad term of "data structures."
I wonder what the best way would be to get into problem framing, optimization design, and evaluation. It might just be through examples. A few case studies could be very illuminating here.
I like the idea. For topic 5, "Data Hygiene", I might also add some examples of when things can go wrong like leakage between X and y features, or leakage between train and test. Giving students some exposure to how things can go wrong and helping them develop intuition to recognize it feels important (I'm going to entirely set aside how think about leakage for LLMs since that feels like a much harder philosophical and practical problem).
Also, on footnote 3 on the obsession with optimization: random forests don't really optimize train set error for classification either, right? But they still work pretty well in most settings.
Absolutely. There should be a whole lecture on leakage and bad data partitioning.
Curious: Why do you say that random forests don't optimize?
Hmm, I suppose I misspoke. Random forests don't optimize log-loss, but they still seem to work quite well for optimizing log-loss (perhaps they require a calibration step after fitting, but still). But I think you're implicitly right that they are a fundamentally that kNN that barely looks at the y-data at all when fitting a model.
But Extra Random Forests (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html) exist and they certainly don't optimize!
Unless it has already been taught, you should include an introductory class on what are simple statistical regression techniques and where they work well. Why use a jackhammer when a small hammer is perfectly adequate?
Is Stochastic Gradient Descent necessary?
Data Structures should have lots of examples of where certain techniques work best and where they fail.
And lastly, isn't the aphorism "there is no free lunch" about no universally best technique, still applicable?
Yes, linear models are a must. And I'd focus on hyperplane classifiers and how to find them.
Also, I'm the first to argue that if you can reduce your problem to a linear system solve, you have won. But I think that we end up spending too much time on least-squares and normal equations in undergrad ML, and it distracts from a bigger picture.
So while I agree that SGD is not all you need and is often inefficient, I was just saying that you could get away with only teaching SGD and suggesting that there are other optimization algorithms out there for the students who want to dig deeper.
omg absolutely on point 5: I don't come from a computer science background so even when I could get on board with the maths in my pattern recognition ML class, I was really lacking in the whole 'infrastructure' of how to actually run it
“Split the data into train and test. Do whatever you want on train. Evaluate test error to select the best model”. When I look at online discussions of “holdout method”, this isn’t what they seem to say, although sometimes the “validation set” is used the way you are using the test set. I take it that this is “holdout in practice”, or something. Also, I don’t get it: select from what set of models? Ones that are “roughly the same”? Or models from different categories (NN, decision tree,…) that have about the same performance on the training data? It’s not entirely clear what distinguishes the holdout method from just training on the whole dataset. (Nor did I understand this from previous posts. Speak as you might to a young child, or a golden retriever.)
I discovered AI in early 1982, when I chanced on Doug Hofstadter’s Pulitzer-winning first book — still his best — “Godel, Escher, Bach: An Eternal Golden Braid” in a book fair in New Delhi, India. It changed my life. I’ve spent the better part of the past almost four decades in AI in research in ML and AI, seeing it emerge from the shadows being studied by a small number of enthusiasts like me, to today, where it seems to rule the tech industry as a trillion dollar technology.
Then, as now, the biggest problem with AI is how to make accurate predictions. Herbert Simon, one of the founders of the field, made some wild predictions of where the field would be in 10 years, back in the 1960s. He was wrong, by a factor of about 50 years. But, as predictions go, that’s not all that bad.
Most sci fi flicks of that era predicted flying cars by now — recall the beginning scene of Blade Runner — and other than in Chitty Chitty Bang Bang and some James Bond movies, we don’t seem remotely close to getting flying cars. Compared to such predictions, AI has done rather well. Thanks to a whole host of related inventions, from the smartphone to the internet and cloud computing, the reach of AI is more pervasive than ever. Where will AI be in the next 50 years?
This gets to my answer to the question. 40 years ago, I found machine learning the most fascinating field I could possibly study. I don’t think that any longer. The strengths and weaknesses of machine learning have become apparent in the ensuing decades. It’s best to explain this by an analogy, and I love analogies (as does Doug Hofstadter, as his most recent book “Surfaces and Essences” is subtitled “Analogy as the fuel and fire of thinking”).
Imagine you are fascinated, as our ancients were, by the possibility of manned human-powered flight. Every culture known to me has humans soaring in the air like birds in their mythology. In Greek mythology, Daedalus invented wings of wax to help him and his son, Icarus, escape from imprisonment. Sadly, Icarus flew too close to the sun, not heeding his father’s warning, and perished to his death. Indian mythology is riddled with stories of flying machines.
We now have flying machines that whisk us across continents at the speed of sound. But, we need huge airports, mile long runways, jet fuel, seat belts, and all the paraphernalia of modern air travel (don’t get me started on TSA background checks). Where’s our dream of human powered flight, soaring like birds? Gone into mythology, where it shall remain.
ML is in a similar state. Many of us 40 years ago dreamed our machines would learn like us, like children, curious about the world, learn fluency in many languages, help us in our old age, and become our intellectual companions. Alas, that’s largely a pipe dream.
Modern ML, like modern air travel, is a completely different enterprise. It needs huge labeled datasets (now in the petabytes). It’s notoriously brittle, as recent single pixel attacks have shown how vulnerable deep learning, our best ML technology, is. If you cater to its every whim, it can be successful, but it is no match for human learning, as a modern 747 is no match for the common garden sparrows that flit about my backyard.
So, I am curious whether AI will ever reach a state when it will lead to truly intelligent machines that can soar in the sky, like birds do without all the trappings of modern airliners, or will it for ever be consigned to the same fate, needing mile long runways, jet fuel, and seat belts and TSA background checks.
So, like the great MLK, I too have a dream. I dream of the day when machine learners will become like human learners, be like children, eternally curious about the world, not be dependent on terabyte sized labeled datasets, and careful human parameter and architecture tweaking. Is this a pipe dream? Will we get to this promised land? Or, like modern air travel, is this to be our fate to be relegated to intrusive TSA checks when we feel like soaring like the birds?
The best way, in my view, to understand a field is to understand the reason why the field exists in the first place, Why do we need a field like machine learning? In short, what problems does it solve and why?
Let’s start with an analogy, something you do practically every morning: you wake up and get ready to go to work. What problems do you need to solve? For one, you need to put on some clothes to protect your body from the weather and your feet against the rough surfaces you might encounter. You need to perhaps cover your head with a hat or a scarf and protect your eyes with sunglasses against the harsh rays of the sun. These are the problems we need to solve in getting dressed.
Algorithms are like clothes and shoes, hats and scarves and sunglasses, continuing the analogy from above. You could wear sneakers, dress shoes, or high heels. You could wear a T-shirt, a dress shirt, a full length skirt and so on. Clothes and shoes are ways to solve the problem of dressing up for work. Which clothes you wear and what shoes you put on may vary, depending on the occasion and the weather. Similarly, which machine learning algorithm you use may depend on the problem, the data, the distribution of instances etc. The lesson from the fashion industry is quite apt and worth remembering. Problems never change (you always need something to cover your feet), but algorithms change often (new styles of clothes and shoes get created every week or month). Don’t waste time learning fashionable solutions when they will become like yesterday’s newspaper. Problems last, algorithms don’t!
There’s a tendency, unfortunately, of recommending universal solutions to machine learning these days (e.g., learn TensorFlow and code up every algorithm as stochastic gradient descent using a deep neural net). To me, this makes just about as much sense as wrapping yourself up in your bedsheets to go to work. Sure, it covers most parts of your body, and probably could do the job, but it’s a one size fits all approach that neither shows any style or taste, nor any understanding of the machine learning (or dressing) problem.
The machine learning community has spent over four decades trying to understand how to pose the problem of machine learning. Start by understanding a few of these formulations, and resist the temptation to view every machine learning problem through a simplified lens (like supervised learning, one of dozens of ways of posing ML problems). The major categories include unsupervised learning, the most important, followed by reinforcement learning (learning by trial and error, the most widely prevalent in children after unsupervised learning), and finally supervised learning (which occurs rather late, because it requires labels and language, which young children mostly lack in early years). Transfer learning is growing in importance as labeled data is expensive and hard to collect for every new problem. There’s lifelong learning, and online learning, and so on. One of the deepest and most interesting areas of machine learning is the theory of probably approximately correct (or PAC) learning. This is a fascinating area, which looks at the problem of how we can give guarantees that a machine learning algorithm will work reliably or will produce a sufficiently accurate answer. Whether you understand PAC learning or not tells me if you are a ML scientist, or an ML engineer.
The most basic formulation of machine learning, and the one that gets short shrift in many popular expositions, is learning a “representation”. What does this even mean? Take the number “three”. I could write it using three strokes III, or as 11, or as 3. These correspond to the unary, binary, and decimal representations. The latter was invented in India more than 2000 years ago. Remarkably, the Greeks, for all their wisdom, never discovered the use of 0 (zero), and never invented decimal numbers. Claude Shannon, the famed inventor of information theory, popularized binary representations for computers in a famous MS thesis at MIT in the early part of the 20th century.
What does it mean for a computer to “learn” a representation? Take a selfie and imagine writing a program to identify your image (or your spouse or your pet) from the image. The phone uses one representation for the image (usually something like JPEG, which is mathematically called the Fourier basis). It turns out this basis is a terrible representation for machine learning. There are many better representations, and new ones get invented all the time. A representation is like the material that makes up your dress. There’s cotton and polyester and wool and nylon. Each of these has its strengths and weaknesses. Similarly, different representations of input data have their pros and cons. Resist the temptation to view one representation as superior to all the others.
Humans spend most of their day solving sequential tasks (driving, eating, typing, walking, etc.). All of these require making a sequence of decisions, and learning such tasks involves reinforcement learning. Without RL, we would not get very far. Sadly, all textbooks of ML ignore this most basic and important area, to their discredit. Fortunately, there are excellent specialized books that cover this area.
Let me end with two famous maxims from the legendary physicist Richard Feynman about learning a topic. First, he said: “What I cannot create, I do not understand”. What he meant that unless you can recreate an idea or an algorithm yourself, you probably haven’t understood it well enough. Second, he said: “Know how to solve every problem that has already been solved”. This second maxim is to make sure you understand what has been done previously. For most of us, these are hard principles to follow, but to the extent you can follow them, you will find your way to complete mastery over any field, including machine learning. Good luck!
Great intro to machine learning basics! As an AI writer https://eduwriter.ai/ , I appreciate how this breaks down complex concepts into digestible pieces—perfect for anyone looking to understand the fundamentals without getting overwhelmed.
Interesting list!
As a practicing data scientist, I would emphasize even more the hold-out section. How to choose a hold out set correctly in different data types (time series, tabular, etc.). How to verify your hold out set.
I was "fooled" many times to think a model is doing great just because I created an inappropriate hold out set...
Thank you!
This is fairly similar to how I teach AI/ML in my class in an international and public affairs school, to masters in policy students (the class is called "AI: A Survey for Policymakers" and the link is here if you want: https://nparikh.org/sipa6545). The major difference is obviously that I leave out some of the more directly technical components (there is no math or coding they directly do, though they read technical papers and gloss over certain technical details) and there is significantly increased emphasis on case studies in application domains where policy questions arise (criminal justice, automated employment decision tools, healthcare screening, etc) and other policy-oriented topics (fairness, privacy, etc).
But it is similar in the following respects: I emphasize major categories of tasks (regression, classification, structured prediction) and basically give one canonical method for each just so they recognize the name when they read the paper; there are no laundry lists of methods. I spend time talking about things like confusion matrices and various classification metrics one might use, and things like disaggregated evaluation. I strongly emphasize the holdout method, and a fundamental difference between eg the use of linear regression in social sciences (which some of them have seen) and in ML, and connect it to current debates about leakage in LLM evaluation — I don't go into the weeds on it but just point out this is a major current topic. I show them gradient descent visually (I don't bother with stochastic) and show them how as you fiddle with, say, the slope/intercept of a linear regression model the error rate changes so they can visualize it. This is more to demystify the whole process and explain terms they may have seen in passing than for the actual relevance of GD for these students. I emphasize data hygiene, but go a bit further than what you mention in the ethics direction (eg, if you are doing some kind of education model, the quality of data from poorer schools may be more inconsistent vs richer schools and this may impact model behavior in some "fairness" way, fairness taken a bit loosely here); I also emphasize data shift quite a bit for the same reason.
Anyway, I basically agree and I think for a 101 course, it may be worth taking inspiration from what might superficially appear to be "less technical" courses taught outside CS/stats but that can actually be more conceptually sophisticated because they emphasize many topics that are not covered in a standard ML math kind of course because so much time is spent deriving all the different methods and implementing things.
For things like train/test splits, there are nice examples that both illustrate the technical point but also illustrate concrete ethics questions and are not toy things but real world case studies. For example, there are some case studies involving child welfare agencies where a split was incorrectly done that splits complaints about a single household and possibly different children across the split, when you want to keep these together for obvious reasons (the examples/rows are complaints not households/kids).
What about this textbook? https://www.statlearning.com/
Before I left academia for start-up & corporate world, that's what I decided to teach in my intro class, for the reasons you mention.
Sign me up for this class!
Two things worth adding IMO:
- Checking if the test set performance is "suspiciously" good (e.g., "Clever Hans" predictors, in an earlier post, I think you had an radiology example where the NN was picking up some machine artifact). Maybe this is a "Data Drift" thing or a "Data Hygiene" thing?
- For situations where the end-user needs some "comfort" about what the heck the model is doing, maybe some coverage of interpretability?