Just as many of the algorithms and community practices of machine learning were invented in the late 1950s and early 1960s, the foundations of machine learning theory were also established during this time. Many of the analyses of this period were strikingly simple, had surprisingly precise constants, and provided prescient guidelines for contemporary machine learning practice. Here, I’ll summarize the study of the Perceptron, highlighting both its algorithmic and statistical analyses, and using it as a prototype to illustrate further how prediction deviates from the umbrella of classical statistics.

Let’s begin with a classification problem where each individual from some population has a feature vector $x$ and an associated binary label $y$ that we take as valued $\pm 1$ for notational convenience. The goal of the Perceptron is to find a linear separator such that $\langle w, x \rangle>0$ for when $y=1$ and $\langle w, x \rangle<0$ when $y=-1$. We can write this compactly as saying that we want to find a $w$ for which $y \langle w, x \rangle >0$ for as many individuals in the population as possible.

Rosenblatt’s Perceptron provides a simple algorithm for finding such a $w$. The Perceptron inputs an example, checks if it makes the correct classification. If yes, it does nothing and proceeds to the next example. If no, the decision boundary is nudged in the direction of classifying the example correctly next time.

Perceptron

• Start from the initial solution $w_0=0$
• At each step $t=0,1,2,…$:
• Select an individual from the population and look up their attributes: (x_t,y_t).
• Case 1: If $y_t\langle w_t, x_t\rangle \leq 0$, put $$w_{t+1} = w_t + y_t x_t$$
• Case 2: Otherwise put $w_{t+1} = w_t$.

If the examples were selected at random, machine learners would recognize this algorithm as an instance of stochastic gradient descent, still the most ubiquitous way to train classifiers whether they be deep or shallow. Stochastic gradient descent minimizes sums of functions

$f(w) = \frac{1}{N} \sum_{i=1}^N \mathit{loss}( f(x_i; w) , y_i)$

with the update

$w_{t+1} = w_t - \alpha_t \nabla_w \mathit{loss}( f(x_t; w_t) , y_t)\,.$

When the examples are sampled randomly, the Perceptron is stochastic gradient descent with $\alpha_t=1$, $f(x;w) = \langle w,x \rangle$, and loss function $\mathit{loss}(\hat{y},y) = \max(-\hat{y} y, 0)$.

Stochastic gradient methods were invented a few years before the Perceptron. And the relations between these methods were noted by the mid-60s. Vapnik discusses some of this history in Chapter 1.11 of The Nature of Statistical Learning Theory.

While we might be tempted to use a standard stochastic gradient analysis to understand the optimization properties of the Perceptron, it turns out that a more rarified proof technique applies that uses no randomization whatsoever. Moreover, the argument will not only bound errors in optimization but also in generalization. Optimization is concerned with errors on a training data set. Generalization is concerned with errors on data we haven’t seen. The analysis from the 1960s links these two by first understanding the dynamics of the algorithm.

A celebrated result by Al Novikoff in 1962 showed that under reasonable conditions the algorithm makes a bounded number of updates no matter how large the sample size. Novikoff’s result is typically referred to as a mistake bound as it bounds the number of total misclassifications made when running the Perceptron on some data set. The key assumption in Novikoff’s argument is that the positive and negative examples are cleanly separated by a linear function. People often dismiss the Perceptron because of this separability assumption. But for any finite data set, can always add features and end up with a linearly separable problem. And if we add enough features, we’ll usually be separable no matter how many points we have.

This has been the trend in modern machine learning: don’t fear big models and don’t fear getting zero errors on your training set. This is no different than what was being proposed in the Perceptron. In fact, Aizerman, Braverman, and Roeznoer recognized the power of such overparameterization, and extended Novikoff’s argument to “potential functions” that we now recognize as functions belonging to an infinite dimensional Reproducing Kernel Hilbert Space.

To state Novikoff’s result, we make the following assumptions: First, we assume as input a set of examples $S$. We assume every data point has norm at most $R(S)$ and that there exists a hyperplane that correctly classifies all of the data points and is of distance at least $\gamma(S)$ from every data point. This second assumption is called a margin condition that quantifies how separated the given data is. With these assumptions, Novikoff proved the Perceptron algorithm makes at most

${\small \frac{R(S)^2}{\gamma(S)^{2}} }$

mistakes when run on $S$. No matter what the ordering of the data points in $S$, the algorithm makes a bounded number of errors.

The algorithmic analysis of Novikoff has many implications. First, if the data is separable, we can conclude that the Perceptron will terminate if it is run over the data set several times. This is because we can think of $k$ epochs of the Perceptron as running on the union of $k$ distinct copies of $S$, and the Perceptron eventually stops updating when run on this enlarged data set. Hence, the mistake bound tells us something particular about optimization: the Perceptron converges to a solution with zero training errors and hence a global minimizer of the empirical risk.

Second, we can think of the Perceptron algorithm as an online learning algorithm. We need not assume anything distributional about the sequence $S$. We can instead think about how long it takes for the Perceptron to converge to a solution that would have been as good as the optimal classifier. We can quantify this convergence by measuring the regret, equal to

$\mathcal{R}_T = \sum_{t=1}^T \mathrm{error}(w_t, (x_t,y_t)) - \sum_{t=1}^T \mathrm{error}(w_\star, (x_t,y_t))\,,$

where $w_\star$ denotes the optimal hyperplane. That is, the regret counts how frequently the classifier at step $t$ misclassifies the next example in the sequence. Novikoff’s argument shows that, if a sequence is perfectly classifiable, then the accrued regret is a constant that does not scale with T.

A third, less well known application of Novikoff’s bound is as a building block for a generalization bound. A generalization bound estimates the probability of making an error on a new example given that the new example is sampled from the same population as the data thus far sceen. To state the generalization bound for the Perceptron, I now need to return to statistics. Generalization theory concerns statistical validity, and hence we need to define some notion of sampling from the population. I will use the same sampling model I have been using in this blog series. Rather than assuming a statistical model of the population, I will assume we have some population of data from which we can uniformly sample. Our training data will consist of $n$ points sampled uniformly from this population: $S={(x_1,y_1)\ldots, (x_n,y_n) }$.

We know that the Perceptron will find a good linear predictor for the training data if it exists. What we now show is that this predictor also works on new data sampled uniformly from the same population.

To analyze what happens on new data, I will employ an elegant argument I learned from Sasha Rakhlin. This argument appears in a book on Learning Theory by Vapnik and Chervonenkis from 1974, which, to my knowledge, is only available in Russian. Sasha also believes this argument is considerably older as Aizermann and company were making similar “online to batch” constructions in the 1960s. The proof here leverages the assumption that the data are sampled in such a way that they are identically distributed, so we can swap the roles of training and test examples in the analysis. It foreshadows later studies of stability and generalization that would be revisited decades later.

Theorem Let $w(S)$ be the output of the Perceptron on a dataset $S$ after running until the hyperplane makes no more mistakes on $S$. Let $S_n$ denote a training set of $n$ samples uniformly at random from some population. And let $(x,y)$ be an additional independent uniform sample from the same population. Then, the probability of making a mistake on $(x,y)$ is bounded as

$\Pr[y \langle w(S_n), x \rangle \leq 0] \leq \frac{1}{n+1} {\mathbb{E}}_{S_{n+1}}\left[ \frac{R(S_{n+1})^2}{\gamma(S_{n+1})^2} \right]\,.$

To prove the theorem, define the “leave-one-out set” to be the set where we drop $(x_k,y_k)$:

${\scriptsize S^{-k}=\{(x_1,y_1),\dots,(x_{k-1},y_{k-1}),(x_{k+1},y_{k+1}),...,(x_{n+1},y_{n+1})\}\,. }$

With this notation, since all of the data are sampled identically and independently, we can rewrite the probability of a mistake on the final data point as the expectation of the leave-one-out error

${\small \Pr[y \langle w(S_n), x \rangle \leq 0] = \frac1{n+1}\sum_{k=1}^{n+1} \mathbb{E}[\mathbb{1}\{y_k \langle w(S^{-k}), x_k \rangle \leq 0\}]\,. }$

Novikoff’s mistake bound asserts the Perceptron makes at most

${\small m=\frac{R(S_{n+1})^2}{\gamma(S_{n+1})^2} }$

mistakes when run on the entire sequence $S_{n+1}$. Let $I={i_1,\dots,i_m}$ denote the indices on which the algorithm makes a mistake in any of its cycles over the data. If $k$ is not in $I$, the output of the algorithm remains the same after we remove the $k$-th sample from the sequence. It follows that such $k \in S_{n+1}\setminus I$ satisfy $y_k w(S^{-k})x_k \geq 0$ and therefore do not contribute to the right hand side of the summation. The other terms can at most contribute $1$ to the summation. Hence,

$\Pr[y \langle w(S_n), x \rangle \leq 0] \le \frac{\mathbb{E}[m]}{n+1}\,,$

which is what we wanted to prove.

What’s most stunning to me about this argument is that there are no numerical constants or logarithms. The generalization error is perfectly quantified by a simple formula of $R$, $\gamma$, and $n$. There are a variety of other arguments that get the $\tilde{O}(R/(n\gamma))$ scaling with far more complex arguments and large constants and logarithmic terms. For example, one can show that the set of hyperplanes in Euclidean space with norm bounded by $\gamma^{-1}$ has VC dimension $R/\gamma$. Similarly, a Rademacher complexity argument will achieve a similar scaling. These arguments apply to far more algorithms than the Perceptron, but it’s frustrating how this simple algorithm from 1956 gets such a tight bound with such a short argument whereas analyzing more “powerful” algorithms often takes pages of derivations.

It’s remarkable that these bounds on optimization, regret, and generalization worked out in the 1960s all turned out to be optimal for classification theory. This strikes me as particularly odd because when I was in graduate school I was taught that the Perceptron was a failed enterprise. But as fads in AI have come and gone, the role of the Perceptron has remained central for 65 years. We’ve made more progress in machine learning theory since then, but it’s not always at the front of our minds just how long ago we had established our modern learning theory framework.