If ADAM is a bad idea, does that mean Newton's method is a bad idea, too? Assuming I understand correctly, ADAM is the first-order shadow of Newton's method arising from a diagonal approximation to the Hessian. I've always viewed Newton's method as the gold standard and SGD as the necessary compromise because Hessians are too big for models with many, many parameters.
ADAM is neither a Newton method nor a quasi-Newton method. It's an entirely different beast. You can do stochastic Newton and quasi-Newton methods and they definitely work well in many machine learning settings. But the adaptive scaling ADAM uses is its own (not particularly well motivated) beast.
Why minimize *average* prediction error though? Surely ML still cares about strictly proper scores (a la Gneiting & Raftery https://doi.org/10.1198/016214506000001437). (Maybe I'm just reading this too literally and this just means minimizing some strictly proper score in the probabilistic case, right?)
Same premise that we're always minimizing something, it's just not always obvious what the 'right' choice is.
"And though it’s sad for me to admit, there’s a lot of mileage you can get out of having a single optimization package." I wonder where the sadness comes from.
If ADAM is a bad idea, does that mean Newton's method is a bad idea, too? Assuming I understand correctly, ADAM is the first-order shadow of Newton's method arising from a diagonal approximation to the Hessian. I've always viewed Newton's method as the gold standard and SGD as the necessary compromise because Hessians are too big for models with many, many parameters.
ADAM is neither a Newton method nor a quasi-Newton method. It's an entirely different beast. You can do stochastic Newton and quasi-Newton methods and they definitely work well in many machine learning settings. But the adaptive scaling ADAM uses is its own (not particularly well motivated) beast.
Why minimize *average* prediction error though? Surely ML still cares about strictly proper scores (a la Gneiting & Raftery https://doi.org/10.1198/016214506000001437). (Maybe I'm just reading this too literally and this just means minimizing some strictly proper score in the probabilistic case, right?)
Same premise that we're always minimizing something, it's just not always obvious what the 'right' choice is.
"And though it’s sad for me to admit, there’s a lot of mileage you can get out of having a single optimization package." I wonder where the sadness comes from.