This is a live blog of Lecture 3 (part 1 of 2) of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents for this series is here.
Let’s start with our magic Excel spreadsheet from last time and suppose the special column we’re trying to predict is binary. It can only take two values, whether they be 0 or 1, cat or dog, healthy or sick. A simple Excel formula I could try would be to take a weighted combination of the elements in my row, and if that combination is positive, I’ll label that row with a 1. Otherwise, I’ll label it with a -1. Here’s the Excel formula where row 2 has my relevant data, and I’ve stashed the weights in row 1000:
=SIGN(SUMPRODUCT(B2:W2, B1000:W1000))
This formula is called a linear classifier.
But how can we find the weights? Rosenblatt provided a shockingly simple procedure. Start with the weights set to zero. Then iterate through your data set in whatever order you’d like. Take one of the examples in your data set and see if you get the right answer with the current weights. If you do, don’t do anything. If your current classification rule makes a mistake, update the weights. I’ll stop being a sicko and give you the formula as an equation. Let’s say the ith data point is a list of attributes, data[i], and has class label equal to 1 or -1 and denoted label[i]. Then, the perceptron update for the mislabeled example is
weights_new = weights_old + label[i]*data[i]
That’s it. Implementing this in Excel would be a pain because I think you need to invoke the Visual Basic editor. But Chris Re and I figured out how to run this in SQL. Rosenblatt implemented this on an IBM 704 that could do about 12000 arithmetic operations per second. His demo learned to distinguish between punch cards that were marked either on the left side or the right side. And yet, going from this update rule to what we do to train big neural networks today only requires one extra step. How far we’ve come…
Not only is the Perceptron update rule simple, but the analysis is simple too. If you’re interested in the full details, we have a self-contained presentation in Chapter 3 of PPA. I will only give the punch line here. Let’s say you have been promised there is a robust linear classification rule for your data. By robust, I mean that you would need to deform a data point by an amount M to change the prediction of the classifier. When M is large, more effort is required to flip the linear classifier from the correct to incorrect sign. We say that such a classifier has margin M1.
In 1962, Al Novikoff proved that when a large margin classifier exists, the Perceptron will find it. If you run the perceptron on the data set, the number of updates it makes is at most 1/M. In other words, it makes at most 1/M mistakes when fitting the perceptron. The number of updates required will be inversely proportional to how robustly you can classify your data. But there are many other takeaways from this amazing result.
First, if there is a rule that perfectly classifies your data, and you run the algorithm for long enough, the perceptron has to find it eventually. The training rule only makes a finite number of mistakes, so the procedure has to terminate. Second, if you have streaming data but someone promised you that this data has margin M, then the average error on the stream goes to zero as more data streams through. Finally, assume that a random process generates the data. Run the Perceptron algorithm on a data set of size n sampled from this process until it stops making mistakes. Then Novikoff’s mistake bound promises that when you receive a new data point from the random process, the probability of making an error on the new sample is proportional to 1/n.
The first statement is about the convergence of an optimization algorithm. The second statement is about the regret incurred in sequential prediction. The third statement is a generalization bound. All three come for free with the perceptron.
Interpreting Novikoff’s Mistake Bound through these more sophisticated modern analyses highlights how impressive this perceptron is. I’ll show over the next few lectures that no model of linear classification ever gets better bounds on its prediction error. There is something fundamental in our theoretical understanding of optimization and geometry here. The only patterns we know we can recognize with linear models have some form of margin. The margin requirement seems to be true for nonlinear models as well. But what representation of data gives us patterns with large margin? I have no idea. And neither does anyone else.
In this sense, the perceptron analysis is post hoc justification. The algorithm will find a large-margin classifier if it’s there. Whether your problem has margin is a matter of art, engineering, and luck. Tomorrow, I’ll discuss how this drives everyone in machine learning insane.
I’m trying to write this colloquially. But if you want to be precise, assume all of the data points have Euclidean norm at most 1 and measure the required deviation in Euclidean distance. This is exactly how we define margin.
"Can linear rules become conscious of their existence?"
People want to know!