New perspectives on the natural gradient method

In this report we review and discuss some theoretical aspects of Amari's natural gradient method, provide a unifying picture of the many different versions of it which have appeared over the years, and offer some new insights and perspectives regarding the method and its relationship to other optimization methods. Among our various contributions is the identification of a general condition under which the Fisher information matrix and Schraudolph's generalized Gauss-Newton matrix are equivalent. This equivalence implies that optimization methods which use the latter matrix, such as the Hessian-free optimization approach of Martens, are actually natural gradient methods in disguise. It also lets us view natural gradient methods as approximate Newton methods, justifying the application of various "update damping" techniques to them, which are designed to compensate for break-downs in local quadratic approximations. Additionally, we analyze the parameterization invariance possessed by the natural gradient method in the idealized setting of infinitesimally small update steps, and consider the extent to which it holds for practical versions of the method which take large discrete steps. We go on to show that parameterization invariance is not possessed by the classical Newton-Raphson method (even in the idealized setting), and then give a general characterization of gradient-based methods which do possess it.

[1]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[2]  Sham M. Kakade,et al.  Competing with the Empirical Risk Minimizer in a Single Pass , 2014, COLT.

[3]  Ilya Sutskever,et al.  Estimating the Hessian by Back-propagating Curvature , 2012, ICML.

[4]  Richard H. Bartels,et al.  Algorithm 432 [C2]: Solution of the matrix equation AX + XB = C [F4] , 1972, Commun. ACM.

[5]  Kenji Fukumizu,et al.  Adaptive natural gradient learning algorithms for various stochastic models , 2000, Neural Networks.

[6]  O. Chapelle Improved Preconditioner for Hessian Free Optimization , 2011 .

[7]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[8]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[9]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[10]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[11]  Tom Heskes,et al.  On Natural Learning and Pruning in Multilayered Perceptrons , 2000, Neural Computation.

[12]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[13]  Noboru Murata,et al.  A Statistical Study on On-line Learning , 1999 .

[14]  Te-son Kuo,et al.  Trace bounds on the solution of the algebraic matrix Riccati and Lyapunov equation , 1986 .

[15]  John Moody,et al.  Learning rate schedules for faster stochastic gradient search , 1992, Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop.

[16]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[17]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[18]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[19]  N. Komaroff Upper summation and product bounds for solution eigenvalues of the Lyapunov matrix equation , 1992 .

[20]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[21]  N. Komaroff Simultaneous eigenvalue lower bounds for the Lyapunov matrix equation , 1988 .

[22]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[23]  Francis R. Bach,et al.  From Averaging to Acceleration, There is Only a Step-size , 2015, COLT.

[24]  Grgoire Montavon,et al.  Neural Networks: Tricks of the Trade , 2012, Lecture Notes in Computer Science.

[25]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[26]  Ruslan Salakhutdinov,et al.  Scaling up Natural Gradient by Sparsely Factorizing the Inverse Fisher Matrix , 2015, ICML.

[27]  Ilya Sutskever,et al.  Training Deep and Recurrent Networks with Hessian-Free Optimization , 2012, Neural Networks: Tricks of the Trade.

[28]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[29]  Yann Ollivier,et al.  Riemannian metrics for neural networks I: feedforward networks , 2013, 1303.0818.

[30]  Razvan Pascanu,et al.  Natural Neural Networks , 2015, NIPS.

[31]  Francis R. Bach,et al.  Constant Step Size Least-Mean-Square: Bias-Variance Trade-offs and Optimal Sampling Distributions , 2014, AISTATS 2014.

[32]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[33]  Young Soo Moon,et al.  Bounds in algebraic Riccati and Lyapunov equations: a survey and some new results , 1996 .

[34]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[35]  Jorge J. Moré,et al.  The Levenberg-Marquardt algo-rithm: Implementation and theory , 1977 .

[36]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.

[37]  Shun-ichi Amari,et al.  Adaptive blind signal processing-neural network approaches , 1998, Proc. IEEE.

[38]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[39]  Andrew W. Fitzgibbon,et al.  A fast natural Newton method , 2010, ICML.

[40]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[41]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42]  Daniel Povey,et al.  Krylov Subspace Descent for Deep Learning , 2011, AISTATS.

[43]  Léon Bottou,et al.  On-line learning for very large data sets , 2005 .