Discussion about this post

User's avatar
Jeremy Cohen's avatar

This is not a good conceptual model for neural net optimization. I am 100% positive that if you that if you take anything approaching a real neural net and train it on anything approaching a real dataset using gradient descent at anything approaching a real learning rate, the optimization dynamics will quickly become not-contractive. Assuming square loss, the spectral norm of the NTK JJ’ will grow until reaching the specific value at which the map is not contractive, and will then stop growing. The dynamics during this phase of training will not be locally linear over any timescale, no matter how short.

But you shouldn’t believe a word that I (or anyone else) says — you should run this experiment for yourself on your own net and data, so that you can see with your own eyes that what I am saying is true.

Expand full comment
6 more comments...

No posts