Optimizing Neural Networks with Kronecker-factored Approximate Curvature

We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-Factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by approximating various large blocks of the Fisher (corresponding to entire layers) as being the Kronecker product of two much smaller matrices. While only several times more expensive to compute than the plain stochastic gradient, the updates produced by K-FAC make much more progress optimizing the objective, which results in an algorithm that can be much faster than stochastic gradient descent with momentum in practice. And unlike some previously proposed approximate natural-gradient/Newton methods which use high-quality non-diagonal curvature matrices (such as Hessian-free optimization), K-FAC works very well in highly stochastic optimization regimes. This is because the cost of storing and inverting K-FAC's approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or low-rank approximations to the curvature matrix.

[1]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[2]  R. A. Smith Matrix Equation $XA + BX = C$ , 1968 .

[3]  Jorge J. Moré,et al.  The Levenberg-Marquardt algo-rithm: Implementation and theory , 1977 .

[4]  Geoffrey E. Hinton,et al.  Experiments on Learning by Back Propagation. , 1986 .

[5]  K. Chu The solution of the matrix equations AXB−CXD=E AND (YA−DZ,YC−BZ)=(E,F) , 1987 .

[6]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[7]  John E. Moody,et al.  Note on Learning Rate Schedules for Stochastic Optimization , 1990, NIPS.

[8]  Victor Y. Pan,et al.  An Improved Newton Iteration for the Generalized Inverse of a Matrix, with Applications , 1991, SIAM J. Sci. Comput..

[9]  Alan J. Laub,et al.  Solution of the Sylvester matrix equation AXBT + CXDT = E , 1992, TOMS.

[10]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[11]  M. Pourahmadi Joint mean-covariance models with applications to longitudinal data: Unconstrained parameterisation , 1999 .

[12]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[13]  M. Rattray,et al.  MATRIX MOMENTUM FOR PRACTICAL NATURAL GRADIENT LEARNING , 1999 .

[14]  Kenji Fukumizu,et al.  Adaptive natural gradient learning algorithms for various stochastic models , 2000, Neural Networks.

[15]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[16]  Tom Heskes,et al.  On Natural Learning and Pruning in Multilayered Perceptrons , 2000, Neural Computation.

[17]  C. Loan The ubiquitous Kronecker product , 2000 .

[18]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[19]  Ren-Cang Li 05-01 Sharpness in Rates of Convergence For CG and Symmetric Lanczos Methods , 2005 .

[20]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[21]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[22]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[23]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[24]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[25]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[26]  Ren-Cang Li,et al.  Sharpness in rates of convergence for the symmetric Lanczos method , 2010, Math. Comput..

[27]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[28]  Nando de Freitas,et al.  A tutorial on stochastic approximation algorithms for training Restricted Boltzmann Machines and Deep Belief Nets , 2010, 2010 Information Theory and Applications Workshop (ITA).

[29]  M. Pourahmadi Covariance Estimation: The GLM and Regularization Perspectives , 2011, 1202.1661.

[30]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[31]  Nicol N. Schraudolph,et al.  Centering Neural Network Gradient Factors , 1996, Neural Networks: Tricks of the Trade.

[32]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[33]  Daniel Povey,et al.  Krylov Subspace Descent for Deep Learning , 2011, AISTATS.

[34]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[35]  Tapani Raiko,et al.  Deep Learning Made Easier by Linear Transformations in Perceptrons , 2012, AISTATS.

[36]  Mark W. Schmidt,et al.  Hybrid Deterministic-Stochastic Methods for Data Fitting , 2011, SIAM J. Sci. Comput..

[37]  Jorge Nocedal,et al.  Sample size selection in optimization methods for machine learning , 2012, Math. Program..

[38]  Ilya Sutskever,et al.  Estimating the Hessian by Back-propagating Curvature , 2012, ICML.

[39]  Ilya Sutskever,et al.  Training Deep and Recurrent Networks with Hessian-Free Optimization , 2012, Neural Networks: Tricks of the Trade.

[40]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[41]  Yann Ollivier Riemannian metrics for neural networks , 2013, ArXiv.

[42]  Yann Ollivier,et al.  Riemannian metrics for neural networks I: feedforward networks , 2013, 1303.0818.

[43]  Ryan Kiros,et al.  Training Neural Networks with Stochastic Hessian-Free Optimization , 2013, ICLR.

[44]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.

[45]  Tapani Raiko,et al.  Pushing Stochastic Gradient towards Second-Order Methods -- Backpropagation Learning with Transformations in Nonlinearities , 2013, ICLR.

[46]  Hermann Ney,et al.  Mean-normalized stochastic gradient for large-scale deep learning , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  James Martens,et al.  On the Expressive Efficiency of Sum Product Networks , 2014, ArXiv.

[48]  Sanjeev Khudanpur,et al.  Parallel training of DNNs with Natural Gradient and Parameter Averaging , 2014 .

[49]  James Martens,et al.  New perspectives on the natural gradient method , 2014, ArXiv.

[50]  W. Gibson Solution of Matrix Equations , 2014, The Method of Moments in Electromagnetics.

[51]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[52]  Ruslan Salakhutdinov,et al.  Scaling up Natural Gradient by Sparsely Factorizing the Inverse Fisher Matrix , 2015, ICML.

[53]  Valeria Simoncini,et al.  Computational Methods for Linear Matrix Equations , 2016, SIAM Rev..

[54]  Anne Auger,et al.  Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles , 2011, J. Mach. Learn. Res..