Efficient Subsampled Gauss-Newton and Natural Gradient Methods for Training Neural Networks

We present practical Levenberg-Marquardt variants of Gauss-Newton and natural gradient methods for solving non-convex optimization problems that arise in training deep neural networks involving enormous numbers of variables and huge data sets. Our methods use subsampled Gauss-Newton or Fisher information matrices and either subsampled gradient estimates (fully stochastic) or full gradients (semi-stochastic), which, in the latter case, we prove convergent to a stationary point. By using the Sherman-Morrison-Woodbury formula with automatic differentiation (backpropagation) we show how our methods can be implemented to perform efficiently. Finally, numerical results are presented to demonstrate the effectiveness of our proposed methods.

[1]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[2]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  Di He,et al.  A Gram-Gauss-Newton Method Learning Overparameterized Deep Neural Networks for Regression Problems , 2019, ArXiv.

[5]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[6]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Jorge J. Moré,et al.  The Levenberg-Marquardt algo-rithm: Implementation and theory , 1977 .

[9]  Jorge Nocedal,et al.  Adaptive Sampling Strategies for Stochastic Optimization , 2017, SIAM J. Optim..

[10]  Guillaume Hennequin,et al.  Exact natural gradient in deep linear networks and its application to the nonlinear case , 2018, NeurIPS.

[11]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[12]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[13]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[14]  David Barber,et al.  Practical Gauss-Newton Optimisation for Deep Learning , 2017, ICML.

[15]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[16]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[17]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[18]  Ryan Kiros,et al.  Training Neural Networks with Stochastic Hessian-Free Optimization , 2013, ICLR.

[19]  Daniel Povey,et al.  Krylov Subspace Descent for Deep Learning , 2011, AISTATS.

[20]  Guodong Zhang,et al.  Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks , 2019, NeurIPS.