Fast Evaluation and Approximation of the Gauss-Newton Hessian Matrix for the Multilayer Perceptron

We introduce a fast algorithm for entry-wise evaluation of the Gauss-Newton Hessian (GNH) matrix for the multilayer perceptron. The algorithm has a precomputation step and a sampling step. While it generally requires $O(Nn)$ work to compute an entry (and the entire column) in the GNH matrix for a neural network with $N$ parameters and $n$ data points, our fast sampling algorithm reduces the cost to $O(n+d/\epsilon^2)$ work, where $d$ is the output dimension of the network and $\epsilon$ is a prescribed accuracy (independent of $N$). One application of our algorithm is constructing the hierarchical-matrix (\hmatrix{}) approximation of the GNH matrix for solving linear systems and eigenvalue problems. While it generally requires $O(N^2)$ memory and $O(N^3)$ work to store and factorize the GNH matrix, respectively. The \hmatrix{} approximation requires only $\bigO(N r_o)$ memory footprint and $\bigO(N r_o^2)$ work to be factorized, where $r_o \ll N$ is the maximum rank of off-diagonal blocks in the GNH matrix. We demonstrate the performance of our fast algorithm and the \hmatrix{} approximation on classification and autoencoder neural networks.

[1]  William B. March,et al.  ASKIT: An Efficient, Parallel Library for High-Dimensional Kernel Summations , 2016, SIAM J. Sci. Comput..

[2]  David A. Cohn,et al.  Neural Network Exploration Using Optimal Experiment Design , 1993, NIPS.

[3]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[4]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[5]  V. Rokhlin,et al.  A fast direct solver for boundary integral equations in two dimensions , 2003 .

[6]  Michael I. Jordan,et al.  Stochastic Cubic Regularization for Fast Nonconvex Optimization , 2017, NeurIPS.

[7]  D. Keyes,et al.  Jacobian-free Newton-Krylov methods: a survey of approaches and applications , 2004 .

[8]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[9]  Nicolas Le Roux,et al.  Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods , 2017, AISTATS.

[10]  Chenhan D. Yu,et al.  Geometry-Oblivious FMM for Compressing Dense SPD Matrices , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  James Martens Second-order Optimization for Neural Networks , 2016 .

[12]  Richard Socher,et al.  Block-diagonal Hessian-free Optimization for Training Neural Networks , 2017, ArXiv.

[13]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[14]  Chenhan D. Yu,et al.  Distributed-Memory Hierarchical Compression of Dense SPD Matrices , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .

[16]  Satoshi Matsuoka,et al.  Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Eric Darve,et al.  Parallelization of the inverse fast multipole method with an application to boundary element method , 2020, Comput. Phys. Commun..

[18]  Satoshi Matsuoka,et al.  Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs , 2018, ArXiv.

[19]  Lexing Ying,et al.  Hierarchical Interpolative Factorization for Elliptic Operators: Integral Equations , 2013, 1307.2666.

[20]  Guillaume Hennequin,et al.  Fast Sampling-Based Inference in Balanced Neuronal Networks , 2014, NIPS.

[21]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[22]  Jorge Nocedal,et al.  On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning , 2011, SIAM J. Optim..

[23]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[24]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[25]  Eric Darve,et al.  A fast block low-rank dense solver with applications to finite-element matrices , 2014, J. Comput. Phys..

[26]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[27]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[28]  Thomas O'Leary-Roseberry,et al.  Inexact Newton Methods for Stochastic Non-Convex Optimization with Applications to Neural Network Training , 2019, 1905.06738.

[29]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[30]  W. Hackbusch,et al.  Hierarchical Matrices: Algorithms and Analysis , 2015 .

[31]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[32]  Peng Xu,et al.  Sub-sampled Newton Methods with Non-uniform Sampling , 2016, NIPS.

[33]  Léon Bottou,et al.  Diagonal Rescaling For Neural Networks , 2017, ArXiv.

[34]  Samuel Williams,et al.  An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling , 2015, SIAM J. Sci. Comput..

[35]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[36]  Philip E. Gill,et al.  Practical optimization , 1981 .

[37]  Alexander G. Gray,et al.  Fast High-dimensional Kernel Summations Using the Monte Carlo Multipole Method , 2008, NIPS.

[38]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[39]  Jianlin Xia,et al.  Fast algorithms for hierarchically semiseparable matrices , 2010, Numer. Linear Algebra Appl..

[40]  Eric Darve,et al.  A distributed-memory hierarchical solver for general sparse linear systems , 2017, Parallel Comput..

[41]  Barbara Kaltenbacher,et al.  Iterative Solution Methods , 2015, Handbook of Mathematical Methods in Imaging.

[42]  Kurt Keutzer,et al.  Hessian-based Analysis of Large Batch Training and Robustness to Adversaries , 2018, NeurIPS.

[43]  Haishan Ye,et al.  Approximate Newton Methods and Their Local Convergence , 2017, ICML.

[44]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.