Discussion about this post

User's avatar
Ishan Gaur's avatar

Hi Ben, I have a question related to the kernel methods we discussed in class today. I found the connection between kernels and neural networks a bit puzzling. It seems that our claim of kernels' expressivity (with # params = # datapoints, as opposed to neural networks) is built on the assumption that v is orthogonal not only to our sampled data but the entire subspace spanned by all possible future lifted inputs from our data. Otherwise, we can't drop v when evaluating w^T Phi(x) without incurring an approximation error, right?

This seems to point at the lift implied by our choice of kernel being a key part of controlling how much data is needed to fit kernel method's coefficients and how much we might over/under-fit. Does it make sense to think of the general purpose kernels we saw today as dealing with this by:

1. Technically having an infinite basis so the non-identical lifted datapoints give you access to slightly different dimensions from the function basis, and as a whole, give you an expressive subspace with which you can define your decision boundary.

2. They also have a rapid decay on the "higher resolution" basis terms so that the approximate dimension of the lifted data space is also not too big: at least hopefully not bigger than the number of data points.

Assuming that seems reasonable, I have two follow up questions:

1. What's the "inductive bias" of the kernels we normally use? Does the data we train with end up living in a subspace of roughly constant dimensionality as we increase the number of datapoints? In the limit, does this tell us something about optimal compression of our data?

2. Do you think there might be something about neural networks that deals with this tradeoff well? If kernels are really at the heart of any non-linear function approximation, then the huge number of parameters in a network must somehow be useful for finding a lifting where the dimensionality of the data ends up being small (at least smaller than the training data when it works), right?

Expand full comment
Joe Jordan's avatar

I have read enough papers with the title "neural networks are universal function approximators" to have assimilated that information, but I guess I had never thought about the shape of the function before. I have just conceptualized NNs as a mapping from some wonky high dimensional space to (and from) some other wonky space, possibly also high dimensional. What could we hope to learn from the shape of the function? We are already just fitting the parameters of the function. Is the question you are asking more like "what bounds could we set on the number of parameters in the function"?

Expand full comment
4 more comments...

No posts