论文信息 - New perspectives on the natural gradient method

New perspectives on the natural gradient method

In this report we review and discuss some theoretical aspects of Amari's natural gradient method, provide a unifying picture of the many different versions of it which have appeared over the years, and offer some new insights and perspectives regarding the method and its relationship to other optimization methods. Among our various contributions is the identification of a general condition under which the Fisher information matrix and Schraudolph's generalized Gauss-Newton matrix are equivalent. This equivalence implies that optimization methods which use the latter matrix, such as the Hessian-free optimization approach of Martens, are actually natural gradient methods in disguise. It also lets us view natural gradient methods as approximate Newton methods, justifying the application of various "update damping" techniques to them, which are designed to compensate for break-downs in local quadratic approximations. Additionally, we analyze the parameterization invariance possessed by the natural gradient method in the idealized setting of infinitesimally small update steps, and consider the extent to which it holds for practical versions of the method which take large discrete steps. We go on to show that parameterization invariance is not possessed by the classical Newton-Raphson method (even in the idealized setting), and then give a general characterization of gradient-based methods which do possess it.

James Martens | James Martens

[1] Nicol N. Schraudolph,et al. Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[2] Sham M. Kakade,et al. Competing with the Empirical Risk Minimizer in a Single Pass , 2014, COLT.

[3] Ilya Sutskever,et al. Estimating the Hessian by Back-propagating Curvature , 2012, ICML.

[4] Richard H. Bartels,et al. Algorithm 432 [C2]: Solution of the matrix equation AX + XB = C [F4] , 1972, Commun. ACM.

[5] Kenji Fukumizu,et al. Adaptive natural gradient learning algorithms for various stochastic models , 2000, Neural Networks.

[6] O. Chapelle. Improved Preconditioner for Hessian Free Optimization , 2011 .

[7] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[8] James Martens,et al. New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[9] Yann LeCun,et al. Improving the convergence of back-propagation learning with second-order methods , 1989 .

[10] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.

[11] Tom Heskes,et al. On Natural Learning and Pruning in Multilayered Perceptrons , 2000, Neural Computation.