Hi Ben, I have a question related to the kernel methods we discussed in class today. I found the connection between kernels and neural networks a bit puzzling. It seems that our claim of kernels' expressivity (with # params = # datapoints, as opposed to neural networks) is built on the assumption that v is orthogonal not only to our sampled data but the entire subspace spanned by all possible future lifted inputs from our data. Otherwise, we can't drop v when evaluating w^T Phi(x) without incurring an approximation error, right?
This seems to point at the lift implied by our choice of kernel being a key part of controlling how much data is needed to fit kernel method's coefficients and how much we might over/under-fit. Does it make sense to think of the general purpose kernels we saw today as dealing with this by:
1. Technically having an infinite basis so the non-identical lifted datapoints give you access to slightly different dimensions from the function basis, and as a whole, give you an expressive subspace with which you can define your decision boundary.
2. They also have a rapid decay on the "higher resolution" basis terms so that the approximate dimension of the lifted data space is also not too big: at least hopefully not bigger than the number of data points.
Assuming that seems reasonable, I have two follow up questions:
1. What's the "inductive bias" of the kernels we normally use? Does the data we train with end up living in a subspace of roughly constant dimensionality as we increase the number of datapoints? In the limit, does this tell us something about optimal compression of our data?
2. Do you think there might be something about neural networks that deals with this tradeoff well? If kernels are really at the heart of any non-linear function approximation, then the huge number of parameters in a network must somehow be useful for finding a lifting where the dimensionality of the data ends up being small (at least smaller than the training data when it works), right?
One of the peculiar facts about kernels is that even if I tell you in advance what the test set is, your optimal answer is going to lie in the span of the training set. It's a weird fact that is related to something called "The Representer Theorem." The only data that matters is the data where you constrain the labels.
To answer the follow-up questions, I have a clarification question: What is "inductive bias?"
Thanks for mentioning The Representer Theorem Ben! It was really thought provoking to learn about and I understand Tuesday's lecture way better now. It seems like the phenomenon you described about access to test inputs not changing the selected function is both a property of how we've defined risk and the class of functions (RKHS) we've selected. I have two follow-up questions:
1. Can unlabeled data enter the picture painted by the representer theorem through how I select a risk function? For instance, I can reweight the errors for each labeled point according to how close they are to the test points without changing the fact that the risk function only needs to be measured on the training data. I similarly can't break this by adding something like a smoothness assumption on the test data outputs. If I do that, I will now have some feasible set of labels for those test outputs. Only now that they are labeled data can I use my risk function to select components of my function space outside of the span of the original training set.
2. Relating to what I meant by inductive biases earlier. My guess is that most kernels we discuss in class are fully general in terms of the functions they can express. However, the space of functions spanned by my data might be different in different RHKSs. Are there examples of intuitive ways to think about this for different kernels+datasets? Is choosing/constructing the right kernel often a bottleneck with kernel methods? I'm assuming once you have a good one that optimizing the coefficients is straightforward for something like an MSE objective.
I have read enough papers with the title "neural networks are universal function approximators" to have assimilated that information, but I guess I had never thought about the shape of the function before. I have just conceptualized NNs as a mapping from some wonky high dimensional space to (and from) some other wonky space, possibly also high dimensional. What could we hope to learn from the shape of the function? We are already just fitting the parameters of the function. Is the question you are asking more like "what bounds could we set on the number of parameters in the function"?
The ideal would be the Shannon-Nyquist theorem, which tells you how much data, computation, and model parameters you need to approximate a signal with a particular frequency content.
This is asking for far too much in machine learning, but some guidelines would be nice, no? It's a tough state of affairs when the advice is now basically "pick as many parameters as you can based on your GPU memory size and time to your deadline."
We have some of this information. Recent papers on LLM quantization have shown that each model parameter contains about 16 bits of information (which is why FP4 quantization works surprisingly well). Connect this to the insight from the chinchilla scaling paper that it takes about 20 data points per model parameter to train an LLM and you are starting to get somewhere. But then the question becomes "why is the information content of the model parameters so low?" It would be nice to have some theory about how you could go about changing a model architecture to allow more information per parameter to be learned.
Hi Ben, I have a question related to the kernel methods we discussed in class today. I found the connection between kernels and neural networks a bit puzzling. It seems that our claim of kernels' expressivity (with # params = # datapoints, as opposed to neural networks) is built on the assumption that v is orthogonal not only to our sampled data but the entire subspace spanned by all possible future lifted inputs from our data. Otherwise, we can't drop v when evaluating w^T Phi(x) without incurring an approximation error, right?
This seems to point at the lift implied by our choice of kernel being a key part of controlling how much data is needed to fit kernel method's coefficients and how much we might over/under-fit. Does it make sense to think of the general purpose kernels we saw today as dealing with this by:
1. Technically having an infinite basis so the non-identical lifted datapoints give you access to slightly different dimensions from the function basis, and as a whole, give you an expressive subspace with which you can define your decision boundary.
2. They also have a rapid decay on the "higher resolution" basis terms so that the approximate dimension of the lifted data space is also not too big: at least hopefully not bigger than the number of data points.
Assuming that seems reasonable, I have two follow up questions:
1. What's the "inductive bias" of the kernels we normally use? Does the data we train with end up living in a subspace of roughly constant dimensionality as we increase the number of datapoints? In the limit, does this tell us something about optimal compression of our data?
2. Do you think there might be something about neural networks that deals with this tradeoff well? If kernels are really at the heart of any non-linear function approximation, then the huge number of parameters in a network must somehow be useful for finding a lifting where the dimensionality of the data ends up being small (at least smaller than the training data when it works), right?
One of the peculiar facts about kernels is that even if I tell you in advance what the test set is, your optimal answer is going to lie in the span of the training set. It's a weird fact that is related to something called "The Representer Theorem." The only data that matters is the data where you constrain the labels.
To answer the follow-up questions, I have a clarification question: What is "inductive bias?"
Thanks for mentioning The Representer Theorem Ben! It was really thought provoking to learn about and I understand Tuesday's lecture way better now. It seems like the phenomenon you described about access to test inputs not changing the selected function is both a property of how we've defined risk and the class of functions (RKHS) we've selected. I have two follow-up questions:
1. Can unlabeled data enter the picture painted by the representer theorem through how I select a risk function? For instance, I can reweight the errors for each labeled point according to how close they are to the test points without changing the fact that the risk function only needs to be measured on the training data. I similarly can't break this by adding something like a smoothness assumption on the test data outputs. If I do that, I will now have some feasible set of labels for those test outputs. Only now that they are labeled data can I use my risk function to select components of my function space outside of the span of the original training set.
2. Relating to what I meant by inductive biases earlier. My guess is that most kernels we discuss in class are fully general in terms of the functions they can express. However, the space of functions spanned by my data might be different in different RHKSs. Are there examples of intuitive ways to think about this for different kernels+datasets? Is choosing/constructing the right kernel often a bottleneck with kernel methods? I'm assuming once you have a good one that optimizing the coefficients is straightforward for something like an MSE objective.
I have read enough papers with the title "neural networks are universal function approximators" to have assimilated that information, but I guess I had never thought about the shape of the function before. I have just conceptualized NNs as a mapping from some wonky high dimensional space to (and from) some other wonky space, possibly also high dimensional. What could we hope to learn from the shape of the function? We are already just fitting the parameters of the function. Is the question you are asking more like "what bounds could we set on the number of parameters in the function"?
The ideal would be the Shannon-Nyquist theorem, which tells you how much data, computation, and model parameters you need to approximate a signal with a particular frequency content.
This is asking for far too much in machine learning, but some guidelines would be nice, no? It's a tough state of affairs when the advice is now basically "pick as many parameters as you can based on your GPU memory size and time to your deadline."
We have some of this information. Recent papers on LLM quantization have shown that each model parameter contains about 16 bits of information (which is why FP4 quantization works surprisingly well). Connect this to the insight from the chinchilla scaling paper that it takes about 20 data points per model parameter to train an LLM and you are starting to get somewhere. But then the question becomes "why is the information content of the model parameters so low?" It would be nice to have some theory about how you could go about changing a model architecture to allow more information per parameter to be learned.