It's not a trick...

It's not an illusion either.

Sep 20, 2023

I off-handedly mentioned yesterday that I could write a whole blog about kernels. Jessica Dai (editor-in-chief of the wonderful magazine Kernel) called me to task and asked me to do it. She asked several questions including

I remember making a big deal out of "the kernel trick," but also like, I don't think I ever really understood "so what?"

I realized this morning that I could write several blogs about kernels, but let me start here.

Yesterday, I described a bunch of ways to get tons of features by nonlinear rules. One question everyone always has in machine learning is how many features do we need to make good predictions? From the perspective of getting the training error to zero, it always suffices to have exactly n features. I can set up n-equations, forcing a linear combination of my features to equal a one-hot encoding of my labels. As long as the features I’ve built are linearly independent, I’ll be able to find a solution that will get my training error to zero.

Let me write this out in math. Each data point is a feature vector x. If I stack the transposes of all of these vectors into a matrix X that is n x d. I want a linear predictor w. Let y denote the matrix of 1-hot encodings of the class labels. We can focus on solving the system of equations Xw=y. As long as n is less than d and X has full rank, we can find a solution to this system.

But when d is bigger than n, there is an infinite set of w to choose from. Which one should you pick? The basic idea of the “kernel trick” is that you should pick w that lies in the span of the x_i. Why? If you run stochastic gradient descent or the perceptron, the update rule is always of the form

\(\mathbf{w}_{k+1} = \mathbf{w}_k + e_k \mathbf{x}_k\)

So if you initialize w at zero this means the iterates always lie in the span of the data. Maybe another motivation would be that I can always write w = w_s + w_p where w_s is in the span of the data and w_p is orthogonal to the data. On your data set,

\(f(\mathbf{x}_i) = \langle \mathbf{w}, \mathbf{x}_i \rangle = \langle \mathbf{w}_s, \mathbf{x}_i \rangle\)

The training set gives us zero insights about the part of w outside of the span of the data. For this reason, there’s no reason to add any components outside the span unless we have strong, compelling outside information that these other components matter.

We can write that w is in the span of the data by the equation w = X^Tc for some coefficient vector c. If I combine this equation with the other equation, I get that y = XX^Tc. This is a system of n equations with n unknowns and has a unique solution as long as X has full rank. We went from having too many predictors to nailing down a unique predictor.

And we have also transformed the feature generation problem. The matrix K=XX^T is called the kernel matrix. It is the matrix of inner products on the training data. Each entry is the cosine similarity between a pair of data points. This argument shows that the only things we care about in a feature representation are the cosine similarities this representation yields on our training data.

Rather than materializing features, we can focus on functions that return cosine similarities. And we generate cosine similarities just like we generated nonlinear features. There’s a simple algebra that lets you construct almost anything. As an exercise to the reader, verify that if k₁(x,z) and k₂(x,z) are two kernel functions that represent actual inner products, then if a,b>0,

\(a k_1(\mathbf{x},\mathbf{z}) + b k_2(\mathbf{x},\mathbf{z}) ~~~\text{and}~~~k_1(\mathbf{x},\mathbf{z}) k_2(\mathbf{x},\mathbf{z})\)

are also kernel functions. Adding kernels amounts to concatenating features. Multiplying kernels amounts to creating the pairwise products of all of the features. But note that we never had to materialize them. And if I follow this argument, this means that any convergent series

\(\sum_{t=0}^\infty \lambda_t \langle x, z \rangle^t\)

is a kernel function. It represents tons of combinations of features from the original space. This is how we end up with the wild world of too many kernels. If we invented this method today, the title would be “Cosine Similarity Is All You Need.” I’m glad we didn’t do this in the 1960s.

As I said, I could write a lot more about kernels in a lot more detail. Kernels have a long history in mathematics, dating back to the 1800s. They have been used in machine learning since the 1960s when Aizermann proposed the method of potential functions, a kernelized extension of the Perceptron. And they have applications far beyond machine learning. If you want more details, bug me in the comments. But I want to close here by answering one more of Jessica’s questions: “What are the applications kernels are suited for now?”

Kernel methods require building big matrices and solving big linear systems. These operations scale quadratically and cubically with dataset size respectively. I have found that as long as you can efficiently solve the linear system exactly, kernel methods do well. I realize this data set is a joke at this point, but I watched many talks before 2012 featuring people loudly bragging about their results on MNIST (oh, you know who you are). This stupid Python script gets 1.13% error on MNIST by solving Kc=y with a polynomial kernel. However, if you gave me a problem with more than 100 thousand data points, I’d have to use another method.

In contrast to what esteemed Turing Laureates say, this is generally true of nonlinear prediction. There’s no “right” way to do anything. You just want to find the approach that lets you train the biggest model fastest. The constraints of practice are then infrastructural and cultural rather than mathematical or metaphysical.

Davis Yoshida

> The training set gives us zero insights about the part of w outside of the span of the data.

I'm confused about why this specifically is the case. Without further assumptions, the training set doesn't give us information about inputs which aren't included in the training set. Obviously we need to make some inductive assumptions. Which assumption leads to singling out linear combinations of the training data as special?

Expand full comment

4 replies by Ben Recht and others

Lalitha Sankar

Sep 21, 2023

Super enjoying these writings even as I teach these concepts in class (and don't have much time to respond; I'm still hoping to respond someday to your first post on Shannon, LLMs, and the lack of need for probability in ML (I disagree but that's for another day).

"You are nothing beyond your data" -- so, yeah, kernels clarify that any prediction is a linear combination of the high-dimensional version (kernelized) of the training data. That said, there is a phase transition computationally with these methods. In my lab, some of my students focused on data science for power systems use kernels methods (some modicum of "explainability") but are kernels even a thing these days given your previous blog on all things NNs? And where does NTK fit in all this? :)

3 replies by Ben Recht and others

12 more comments...

arg min

Discussion about this post