This is a live blog of Lecture 2 of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents for this series is here.
To frame the prototypical machine learning problem, I like to use a metaphor I borrowed from Chris Wiggins. I start with a spreadsheet. Each row of the spreadsheet corresponds to some unit or example. But I don’t really care what the units mean. I just know that I have a bunch of columns filled in with data. And I’m told one of the columns is special. We’re about to get a load of new rows in our spreadsheet, but someone downstairs forgot to fill in the special column. Management has tasked me with writing an Excel formula to fill in what should be there. For whatever reason, I don’t get to see these new rows and have to build the formula from the spreadsheet I have.
How might I go about it? Let’s work through some cases. I could imagine matching rows. If I take a new row, compare it to all of the others I have so far, and find an exact match, I can use the value of the special column in the matched row as my value in the new row. That’s pretty straightforward. I’ll look for duplicate rows where I have data and fill in the missing values by using the values in those rows.
But what if I find multiple duplicates with different numbers in the special column? What do I fill in now? I could use the most frequently occurring value. I could use the mean value. I could return a function that spits out a random value every time I open the spreadsheet. All of these are valid options, and I’ll have to ask management what would be best. For this, they’ll hopefully give me a precise specification that lets me define a quantitative measure of accuracy, and I’ll pick the value that maximizes accuracy.
And what if no row matches exactly? Shoot, now I have to make another decision. I could find the closest row, but closest in what sense? And maybe I should look at far away rows and use the opposite value than what they have? Ugh, there are too many options, and choosing the function now seems impossible.
So maybe I’ll do an experiment. I’ll take the last row of my spreadsheet and pretend I don’t have the special column. I’ll write as many formulas as I can. The formula might need a very complicated Excel expression, but you can do arbitrarily absurd things in Excel. Out of all of these formulas, I’ll pick one that guesses the special column. And then I’ll use that one for the new data too.
But why single out that last row? I can do something similar for every row! I’ll invent a set of plausible functions. I’ll evaluate how well they predict on the spreadsheet I have. I’ll choose the function that maximizes the accuracy.
This is more or less the art of machine learning.
I mean, really, how different is this from what we do in a Kaggle competition or on a dataset in the venerable UCI repository? We’re just given csv files and tasked with maximizing accuracy on special columns. We never think about what the entries mean.
Even my random number generator joke above is valid: Shannon’s language prediction game would correspond to the columns being characters and the goal returning a sample from a language model. And since they are just Shannon models at scale, the metaphor obviously then extends to our AI Overlords known as Large Language Models.
We blindly accept that whoever made the spreadsheet believes the special column is predictable from the rest. The admission fee for machine learning is believing that pattern recognition is possible. This is the inescapable but almost always implicit model.
The question is, if we assume that there is a pattern that’s well predictable by some computational mechanism, can we find it? When does searching for a function that maximizes accuracy on a data set we have collected find us a function that makes good predictions on data we haven’t seen yet?
In the next four lectures, I’ll describe four different plausibility arguments for when and why this might work.
This is an amazing analogy for machine learning. Definitely plan to use it in my classes if I can.