Efficient Approximations of the Fisher Matrix in Neural Networks using Kronecker Product Singular Value Decomposition

Several studies have shown the ability of natural gradient descent to minimize the objective function more efficiently than ordinary gradient descent based methods. However, the bottleneck of this approach for training deep neural networks lies in the prohibitive cost of solving a large dense linear system corresponding to the Fisher Information Matrix (FIM) at each iteration. This has motivated various approximations of either the exact FIM or the empirical one. The most sophisticated of these is KFAC, which involves a Kronecker-factored block diagonal approximation of the FIM. With only a slight additional cost, a few improvements of KFAC from the standpoint of accuracy are proposed. The common feature of the four novel methods is that they rely on a direct minimization problem, the solution of which can be computed via the Kronecker product singular value decomposition technique. Experimental results on the three standard deep auto-encoder benchmarks showed that they provide more accurate approximations to the FIM. Furthermore, they outperform KFAC and state-of-the-art first-order methods in terms of optimization speed.

[1]  Y. Saad,et al.  Numerical Methods for Large Eigenvalue Problems , 2011 .

[2]  Jimmy Ba,et al.  Kronecker-factored Curvature Approximations for Recurrent Neural Networks , 2018, ICLR.

[3]  H. Robbins A Stochastic Approximation Method , 1951 .

[4]  Pascal Vincent,et al.  Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis , 2018, NeurIPS.

[5]  Jonas Köhler,et al.  Two-Level K-FAC Preconditioning for Deep Learning , 2020, ArXiv.

[6]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[7]  D. Goldfarb A family of variable-metric methods derived by variational means , 1970 .

[8]  Gene H. Golub,et al.  Calculating the singular values and pseudo-inverse of a matrix , 2007, Milestones in Matrix Computation.

[9]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[10]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[11]  Frederik Kunstner,et al.  Limitations of the empirical Fisher approximation for natural gradient descent , 2019, NeurIPS.

[12]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[13]  D. Sorensen Numerical methods for large eigenvalue problems , 2002, Acta Numerica.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  R. Fletcher,et al.  A New Approach to Variable Metric Algorithms , 1970, Comput. J..

[16]  Tom Heskes,et al.  On Natural Learning and Pruning in Multilayered Perceptrons , 2000, Neural Computation.

[17]  Roger B. Grosse,et al.  A Kronecker-factored approximate Fisher matrix for convolution layers , 2016, ICML.

[18]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[19]  C. G. Broyden The Convergence of a Class of Double-rank Minimization Algorithms 1. General Considerations , 1970 .

[20]  C. Loan The ubiquitous Kronecker product , 2000 .

[21]  Yann Le Cun,et al.  A Theoretical Framework for Back-Propagation , 1988 .

[22]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[23]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[24]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[25]  Sanjeev Khudanpur,et al.  Parallel training of DNNs with Natural Gradient and Parameter Averaging , 2014 .

[26]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[27]  Donald Goldfarb,et al.  Practical Quasi-Newton Methods for Training Deep Neural Networks , 2020, NeurIPS.

[28]  D. Shanno Conditioning of Quasi-Newton Methods for Function Minimization , 1970 .

[29]  David Barber,et al.  Practical Gauss-Newton Optimisation for Deep Learning , 2017, ICML.

[30]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[31]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[32]  Yann Ollivier,et al.  Riemannian metrics for neural networks I: feedforward networks , 2013, 1303.0818.

[33]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[34]  Roger B. Grosse,et al.  Distributed Second-Order Optimization using Kronecker-Factored Approximations , 2016, ICLR.

[35]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..