There actually is some interesting work on optimizing the weights at init. Essentially that's transfer learning in a bunch of setups. I was trying to do something like this and had trouble getting it to work, eventually these guys figured out one way to do it: https://openreview.net/pdf/304a7da98b79c8c15e8baa9f038951976a9ef764.pdf
My take is that the developmental program does a lot to optimize the init of our biological networks.
LOL @ No. 9. I still remember this tweet - https://mobile.x.com/beenwrekt/status/913418862191710208 and now that "optimally" is in scarequotes in the blog description, we can probably welcome No. 9 as the sneaky way of doing what you suggested in the footnote. (Early neural networks did in fact directly tune the weights and we have algorithms like NEAT and its cousins)
"Learning theory almost never says anything about data."
I have done an experiment where I observed that the same neural network, trained on MNIST and a slightly modified version of it, can attain wildly different generalization performances (difference is %89), even though both datasets have a functional relationship between the input and the output. This shows that data is crucial for predicting generalization, and no bound which does not use the properties of data can predict the generalization performace of artificial neural networks.
statistical learning theory ultimately tells us something about the gap R(fhat)-R(f*), where R is the expected loss on a new example, fhat is our trained model, and f* is the best model within the model class we are considering (e.g., certain neural net architecture). This is often called the estimation error or statistical error. The approximation error is another quantity of interest, R(f*)-R(f**), where f** is the best possible model (usually not with our model class). I wonder what your take is on the approximation error vs estimation error tradeoff.
This is not a tradeoff. It is a decomposition into terms where we add and subtract the same thing for syntactic sugar.
I don't think the decomposition is helpful. The only term in that decomposition we can study rigorously is the estimation error, and everything we infer from overanalyzing estimation error is wrong.
And what can we actually say about approximation error? Usually we get some mumbo jumbo about the curse of dimensionality that tells us all high dimensional functions have to be constant. Are there any interesting insights from approximation theory in machine learning?
i agree that we can't say much about the approximation error term without assumptions, but there is sometimes a tradeoff (e.g., in certain regression settings) and in certain settings we can use physical knowledge to help us understand the approximation error / model class question
So, for example, everything you taught us in Statistical Learning theory is correct - except that upper bounds for estimation error while correct are pessimistic (it is an upper bound after all) and lower bounds pretend that test set doesn't exist!
I wonder if there are examples of an arbitrarily complex model class that must "overfit" when using the holdout method to fit.
Sure, and they weren't the only ones! Lots of people did this. What did we learn beyond the wrong lesson: "there's a curse of dimensionality and so you better regularize the fuck out of everything?"
Approximation error is a red herring anyway, both from the Popperian and from the Humean point of view. Although Andrew Barron's neural net approximation result is valuable conceptually -- instead of asking for rates for approximating any continuous function (which are bad!), let's ask ourselves which functions are efficiently approximable by neural nets. I gave a talk about this at Yale last April: https://youtu.be/8myodaNxhCA?si=48LguAkL2Uas5ZLi
I would love a slightly more technical exposition (like lecture notes level?) on the learning theory points. Bonus if you could highlight any theory that actually does account for features in explaining generalization.
To your point about theory not taking data into account, so much of statistics + learning theory with simplish algorithms has no answer for correlated data with real world magnitude correlations...
In some ways, it has long been conventional wisdom to train on the test set. For instance, if you did a test-train split, chose hyper-parameters with the test set, and were going to deploy the model -- it would make sense to retrain (using those parameters) using both test and train. The more data trained on, it should generalize better.
More clearly, when teaching k-fold cross-validation (which is useful for small data), students would always ask "so then which of the k models should we deploy?" The reasonable answers are either (a) with learned hyper-parameters retrain on all of the data, or (b) somehow average the models (e.g., boosting).
The point of the test set (IMO) is for a more "clean" evaluation of how well your model is doing. If you mix how you evaluate with how you train, then this signal could be more muddled. Of course, its likely still correlated (like your plot shows), but we just don't know how to correct for that muddling yet -- or at least not in all cases, and it may vary based on unknown properties of the data.
> But if you let me run RL on the neural network weights, I can get the test error to zero. Why is it that if we optimize everything else, we don’t get the error to zero?
Multiple test set taking and optimization is what biological evolution is about. The objective function is reproductive success [of the genes]. Organisms have no interest in "the data" or human reasoning, just brute force Darwinian selection of the genes in the population.
Isn't cultural evolution similar? It isn't formal learning, just seeing what works.
ML that uses genetic algorithms (GA) or techniques that have access to data labels, is a shortcut to finding the best features. Unfortunately, while simpler rules-based models like Decision Trees do allow inspections, ANNs have largely hidden that information from inspection, making them as opaque as GAs. [The interaction of genes is not unlike the interaction of ANN neurons and weighted connections, although less structured. more like a Reservoir Computer.]
The adaptive nature of gene evolution has proven very robust over 4by, creating complex interacting networks between species that adapt to, and recover from, changing environments (both physical and biological).
There actually is some interesting work on optimizing the weights at init. Essentially that's transfer learning in a bunch of setups. I was trying to do something like this and had trouble getting it to work, eventually these guys figured out one way to do it: https://openreview.net/pdf/304a7da98b79c8c15e8baa9f038951976a9ef764.pdf
My take is that the developmental program does a lot to optimize the init of our biological networks.
LOL @ No. 9. I still remember this tweet - https://mobile.x.com/beenwrekt/status/913418862191710208 and now that "optimally" is in scarequotes in the blog description, we can probably welcome No. 9 as the sneaky way of doing what you suggested in the footnote. (Early neural networks did in fact directly tune the weights and we have algorithms like NEAT and its cousins)
"Learning theory almost never says anything about data."
I have done an experiment where I observed that the same neural network, trained on MNIST and a slightly modified version of it, can attain wildly different generalization performances (difference is %89), even though both datasets have a functional relationship between the input and the output. This shows that data is crucial for predicting generalization, and no bound which does not use the properties of data can predict the generalization performace of artificial neural networks.
statistical learning theory ultimately tells us something about the gap R(fhat)-R(f*), where R is the expected loss on a new example, fhat is our trained model, and f* is the best model within the model class we are considering (e.g., certain neural net architecture). This is often called the estimation error or statistical error. The approximation error is another quantity of interest, R(f*)-R(f**), where f** is the best possible model (usually not with our model class). I wonder what your take is on the approximation error vs estimation error tradeoff.
I wrote about it here: https://www.argmin.net/p/the-adaptivity-paradox
This is not a tradeoff. It is a decomposition into terms where we add and subtract the same thing for syntactic sugar.
I don't think the decomposition is helpful. The only term in that decomposition we can study rigorously is the estimation error, and everything we infer from overanalyzing estimation error is wrong.
And what can we actually say about approximation error? Usually we get some mumbo jumbo about the curse of dimensionality that tells us all high dimensional functions have to be constant. Are there any interesting insights from approximation theory in machine learning?
i agree that we can't say much about the approximation error term without assumptions, but there is sometimes a tradeoff (e.g., in certain regression settings) and in certain settings we can use physical knowledge to help us understand the approximation error / model class question
Commented briefly about it here.. https://www.argmin.net/p/overfitting-to-theories-of-overfitting/comment/93644128
So, for example, everything you taught us in Statistical Learning theory is correct - except that upper bounds for estimation error while correct are pessimistic (it is an upper bound after all) and lower bounds pretend that test set doesn't exist!
I wonder if there are examples of an arbitrarily complex model class that must "overfit" when using the holdout method to fit.
FWIW, Steve Smale, Felipe Cucker, and Ding Xuan Zhou did attempt to bring insights from approximation theory into machine learning.
Sure, and they weren't the only ones! Lots of people did this. What did we learn beyond the wrong lesson: "there's a curse of dimensionality and so you better regularize the fuck out of everything?"
epsilon = O(n^-{s/d})
is not a good takeaway!
Approximation error is a red herring anyway, both from the Popperian and from the Humean point of view. Although Andrew Barron's neural net approximation result is valuable conceptually -- instead of asking for rates for approximating any continuous function (which are bad!), let's ask ourselves which functions are efficiently approximable by neural nets. I gave a talk about this at Yale last April: https://youtu.be/8myodaNxhCA?si=48LguAkL2Uas5ZLi
Also, Misha: https://arxiv.org/abs/1801.03437
I would love a slightly more technical exposition (like lecture notes level?) on the learning theory points. Bonus if you could highlight any theory that actually does account for features in explaining generalization.
To your point about theory not taking data into account, so much of statistics + learning theory with simplish algorithms has no answer for correlated data with real world magnitude correlations...
In some ways, it has long been conventional wisdom to train on the test set. For instance, if you did a test-train split, chose hyper-parameters with the test set, and were going to deploy the model -- it would make sense to retrain (using those parameters) using both test and train. The more data trained on, it should generalize better.
More clearly, when teaching k-fold cross-validation (which is useful for small data), students would always ask "so then which of the k models should we deploy?" The reasonable answers are either (a) with learned hyper-parameters retrain on all of the data, or (b) somehow average the models (e.g., boosting).
The point of the test set (IMO) is for a more "clean" evaluation of how well your model is doing. If you mix how you evaluate with how you train, then this signal could be more muddled. Of course, its likely still correlated (like your plot shows), but we just don't know how to correct for that muddling yet -- or at least not in all cases, and it may vary based on unknown properties of the data.
> First, hyperparameters are just vibes.
No.
> But if you let me run RL on the neural network weights, I can get the test error to zero. Why is it that if we optimize everything else, we don’t get the error to zero?
This is the heart of it. This question.
That was a teaser question. As I discuss later in the post, we do in fact get the test error to zero if we're allowed to burn enough computing cycles.
Multiple test set taking and optimization is what biological evolution is about. The objective function is reproductive success [of the genes]. Organisms have no interest in "the data" or human reasoning, just brute force Darwinian selection of the genes in the population.
Isn't cultural evolution similar? It isn't formal learning, just seeing what works.
ML that uses genetic algorithms (GA) or techniques that have access to data labels, is a shortcut to finding the best features. Unfortunately, while simpler rules-based models like Decision Trees do allow inspections, ANNs have largely hidden that information from inspection, making them as opaque as GAs. [The interaction of genes is not unlike the interaction of ANN neurons and weighted connections, although less structured. more like a Reservoir Computer.]
The adaptive nature of gene evolution has proven very robust over 4by, creating complex interacting networks between species that adapt to, and recover from, changing environments (both physical and biological).