Discussion about this post

User's avatar
Matthew's avatar

If ADAM is a bad idea, does that mean Newton's method is a bad idea, too? Assuming I understand correctly, ADAM is the first-order shadow of Newton's method arising from a diagonal approximation to the Hessian. I've always viewed Newton's method as the gold standard and SGD as the necessary compromise because Hessians are too big for models with many, many parameters.

Expand full comment
1 more comment...

No posts