> The training set gives us zero insights about the part of w outside of the span of the data.

I'm confused about why this specifically is the case. Without further assumptions, the training set doesn't give us information about inputs which aren't included in the training set. Obviously we need to make some inductive assumptions. Which assumption leads to singling out linear combinations of the training data as special?

Hmm, I'm not convinced we can escape Hume other than through pragmatism.

But that said, I like your definition because it is straightforward and honest. So if you pick a feature representation, that's an inductive assumption. If you assume that you want a linear combination of these features, that's another inductive assumption. Without any other inductive assumptions, this constrains you to searching for functions only spanned by the data.

Super enjoying these writings even as I teach these concepts in class (and don't have much time to respond; I'm still hoping to respond someday to your first post on Shannon, LLMs, and the lack of need for probability in ML (I disagree but that's for another day).

"You are nothing beyond your data" -- so, yeah, kernels clarify that any prediction is a linear combination of the high-dimensional version (kernelized) of the training data. That said, there is a phase transition computationally with these methods. In my lab, some of my students focused on data science for power systems use kernels methods (some modicum of "explainability") but are kernels even a thing these days given your previous blog on all things NNs? And where does NTK fit in all this? :)

There are some interesting ideas that have come out of NTK research, and I think many of these are useful guides for practice. I'll talk about a few of these tomorrow.

I shouldn't really call it that as we all know the complexity of these operations goes up as O(N^3). So for the grid, while such predictors are solid for small to moderate sized networks/telemetry data, they become computationally extremely slow for larger datasizes. From this network size perspective, it feels like a "phase transition" but it is merely a non-linear computational cost that catches up.

Yes, agreed. I'm not even sure the n^3 scaling is directly the issue. Kernel methods are O(n^3), but iterative solvers get you closer to O(n^2). In neural nets, you let the size of model grow with the size of the data set. So every inference is O(n), and then if you do a pass over the data, that's O(n^2).

The constants here matter. Which method is going to map better onto hardware, be easier to implement in software, etc. These factors start to dominate.

As I say in this post, so many of the reasons why methods "win" are related to exogenous factors.

At one point, Sasha Rakhlin and I were thinking of translating the book by Aizerman, Braverman, and Rozonoer into English, with detailed annotation. I may do this yet, just need to wait until my next sabbatical.

I may be missing something, but wouldn’t XX^T be cosine similarity only if the features are normalized, which is presumably not always the case.

As for one place where kernels are still very useful, it is in the context of learning fair representations of data. Here, you typically have conflicting objectives and neural networks with gradient descent have a hard time finding good solutions, or at least not reliably most of the time. By using kernels you can come up with a closed form solution under mild assumptions, which is very stable and gives you the global optimal. And, you can also scale this to at least 200k samples by using your random Fourier features idea.

Yes, I was being too fast and loose here. It's cosine similarity when the data is normalized. I don't know of any examples where you can't get good performance with normalized data. Do you have one in mind?

And yes, I agree that closed-form and predictable solutions are one of the more attractive selling points for kernel methods. I also like that you bring up that we should care about more than test error! After next week, that is all I am going to talk about for the remainder of the semester.

> The training set gives us zero insights about the part of w outside of the span of the data.

I'm confused about why this specifically is the case. Without further assumptions, the training set doesn't give us information about inputs which aren't included in the training set. Obviously we need to make some inductive assumptions. Which assumption leads to singling out linear combinations of the training data as special?

What is an inductive assumption? People use the word "inductive" differently, so I want to make sure I know exactly what you mean.

Something that lets us escape no free lunch theorems (or Hume) and imagine that we can infer anything at all from the training data.

Hmm, I'm not convinced we can escape Hume other than through pragmatism.

But that said, I like your definition because it is straightforward and honest. So if you pick a feature representation, that's an inductive assumption. If you assume that you want a linear combination of these features, that's another inductive assumption. Without any other inductive assumptions, this constrains you to searching for functions only spanned by the data.

Hmm I think I'm missing the last step, I'll have to re-read to make sure I'm understanding right.

Super enjoying these writings even as I teach these concepts in class (and don't have much time to respond; I'm still hoping to respond someday to your first post on Shannon, LLMs, and the lack of need for probability in ML (I disagree but that's for another day).

"You are nothing beyond your data" -- so, yeah, kernels clarify that any prediction is a linear combination of the high-dimensional version (kernelized) of the training data. That said, there is a phase transition computationally with these methods. In my lab, some of my students focused on data science for power systems use kernels methods (some modicum of "explainability") but are kernels even a thing these days given your previous blog on all things NNs? And where does NTK fit in all this? :)

There are some interesting ideas that have come out of NTK research, and I think many of these are useful guides for practice. I'll talk about a few of these tomorrow.

What is this phase transition you refer to?

I shouldn't really call it that as we all know the complexity of these operations goes up as O(N^3). So for the grid, while such predictors are solid for small to moderate sized networks/telemetry data, they become computationally extremely slow for larger datasizes. From this network size perspective, it feels like a "phase transition" but it is merely a non-linear computational cost that catches up.

Yes, agreed. I'm not even sure the n^3 scaling is directly the issue. Kernel methods are O(n^3), but iterative solvers get you closer to O(n^2). In neural nets, you let the size of model grow with the size of the data set. So every inference is O(n), and then if you do a pass over the data, that's O(n^2).

The constants here matter. Which method is going to map better onto hardware, be easier to implement in software, etc. These factors start to dominate.

As I say in this post, so many of the reasons why methods "win" are related to exogenous factors.

At one point, Sasha Rakhlin and I were thinking of translating the book by Aizerman, Braverman, and Rozonoer into English, with detailed annotation. I may do this yet, just need to wait until my next sabbatical.

Dudes. You have to do this. A good translation of the 1974 Vapnik-Chervonenkis book, too.

Next time we hang out and have cocktails, I'll tell you how I was drinking vodka and talking about Hume and machine learning with Rozonoer in 2012.

I may be missing something, but wouldn’t XX^T be cosine similarity only if the features are normalized, which is presumably not always the case.

As for one place where kernels are still very useful, it is in the context of learning fair representations of data. Here, you typically have conflicting objectives and neural networks with gradient descent have a hard time finding good solutions, or at least not reliably most of the time. By using kernels you can come up with a closed form solution under mild assumptions, which is very stable and gives you the global optimal. And, you can also scale this to at least 200k samples by using your random Fourier features idea.

Yes, I was being too fast and loose here. It's cosine similarity when the data is normalized. I don't know of any examples where you can't get good performance with normalized data. Do you have one in mind?

And yes, I agree that closed-form and predictable solutions are one of the more attractive selling points for kernel methods. I also like that you bring up that we should care about more than test error! After next week, that is all I am going to talk about for the remainder of the semester.