论文信息 - Efficient Subsampled Gauss-Newton and Natural Gradient Methods for Training Neural Networks - 字舞流文

Efficient Subsampled Gauss-Newton and Natural Gradient Methods for Training Neural Networks

We present practical Levenberg-Marquardt variants of Gauss-Newton and natural gradient methods for solving non-convex optimization problems that arise in training deep neural networks involving enormous numbers of variables and huge data sets. Our methods use subsampled Gauss-Newton or Fisher information matrices and either subsampled gradient estimates (fully stochastic) or full gradients (semi-stochastic), which, in the latter case, we prove convergent to a stationary point. By using the Sherman-Morrison-Woodbury formula with automatic differentiation (backpropagation) we show how our methods can be implemented to perform efficiently. Finally, numerical results are presented to demonstrate the effectiveness of our proposed methods.

Yi Ren | Donald Goldfarb | Yi Ren | D. Goldfarb

[1] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[2] Nicol N. Schraudolph,et al. Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[3] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[4] Di He,et al. A Gram-Gauss-Newton Method Learning Overparameterized Deep Neural Networks for Regression Problems , 2019, ArXiv.

[5] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[6] Ning Qian,et al. On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[7] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8] Jorge J. Moré,et al. The Levenberg-Marquardt algo-rithm: Implementation and theory , 1977 .

[9] Jorge Nocedal,et al. Adaptive Sampling Strategies for Stochastic Optimization , 2017, SIAM J. Optim..

[10] Guillaume Hennequin,et al. Exact natural gradient in deep linear networks and its application to the nonlinear case , 2018, NeurIPS.

[11] Stephen J. Wright,et al. Numerical Optimization , 2018, Fundamental Statistical Inference.

[12] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[13] Richard Socher,et al. Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[14] David Barber,et al. Practical Gauss-Newton Optimisation for Deep Learning , 2017, ICML.

[15] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[16] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.

[17] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[18] Ryan Kiros,et al. Training Neural Networks with Stochastic Hessian-Free Optimization , 2013, ICLR.

[19] Daniel Povey,et al. Krylov Subspace Descent for Deep Learning , 2011, AISTATS.

[20] Guodong Zhang,et al. Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks , 2019, NeurIPS.