论文信息 - Fast Evaluation and Approximation of the Gauss-Newton Hessian Matrix for the Multilayer Perceptron

Fast Evaluation and Approximation of the Gauss-Newton Hessian Matrix for the Multilayer Perceptron

We introduce a fast algorithm for entry-wise evaluation of the Gauss-Newton Hessian (GNH) matrix for the multilayer perceptron. The algorithm has a precomputation step and a sampling step. While it generally requires $O(Nn)$ work to compute an entry (and the entire column) in the GNH matrix for a neural network with $N$ parameters and $n$ data points, our fast sampling algorithm reduces the cost to $O(n+d/\epsilon^2)$ work, where $d$ is the output dimension of the network and $\epsilon$ is a prescribed accuracy (independent of $N$). One application of our algorithm is constructing the hierarchical-matrix (\hmatrix{}) approximation of the GNH matrix for solving linear systems and eigenvalue problems. While it generally requires $O(N^2)$ memory and $O(N^3)$ work to store and factorize the GNH matrix, respectively. The \hmatrix{} approximation requires only $\bigO(N r_o)$ memory footprint and $\bigO(N r_o^2)$ work to be factorized, where $r_o \ll N$ is the maximum rank of off-diagonal blocks in the GNH matrix. We demonstrate the performance of our fast algorithm and the \hmatrix{} approximation on classification and autoencoder neural networks.

[1] William B. March,et al. ASKIT: An Efficient, Parallel Library for High-Dimensional Kernel Summations , 2016, SIAM J. Sci. Comput..

[2] David A. Cohn,et al. Neural Network Exploration Using Optimal Experiment Design , 1993, NIPS.

[3] Ilya Sutskever,et al. Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[4] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.

[5] V. Rokhlin,et al. A fast direct solver for boundary integral equations in two dimensions , 2003 .

[6] Michael I. Jordan,et al. Stochastic Cubic Regularization for Fast Nonconvex Optimization , 2017, NeurIPS.

[7] D. Keyes,et al. Jacobian-free Newton-Krylov methods: a survey of approaches and applications , 2004 .

[8] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[9] Nicolas Le Roux,et al. Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods , 2017, AISTATS.

[10] Chenhan D. Yu,et al. Geometry-Oblivious FMM for Compressing Dense SPD Matrices , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11] James Martens. Second-order Optimization for Neural Networks , 2016 .

[12] Richard Socher,et al. Block-diagonal Hessian-free Optimization for Training Neural Networks , 2017, ArXiv.

[13] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[14] Chenhan D. Yu,et al. Distributed-Memory Hierarchical Compression of Dense SPD Matrices , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15] Michael Rabadi,et al. Kernel Methods for Machine Learning , 2015 .

[16] Satoshi Matsuoka,et al. Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Eric Darve,et al. Parallelization of the inverse fast multipole method with an application to boundary element method , 2020, Comput. Phys. Commun..

[18] Satoshi Matsuoka,et al. Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs , 2018, ArXiv.

[19] Lexing Ying,et al. Hierarchical Interpolative Factorization for Elliptic Operators: Integral Equations , 2013, 1307.2666.

[20] Guillaume Hennequin,et al. Fast Sampling-Based Inference in Balanced Neuronal Networks , 2014, NIPS.

[21] James Martens,et al. New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[22] Jorge Nocedal,et al. On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning , 2011, SIAM J. Optim..

[23] Babak Hassibi,et al. Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[24] Klaus-Robert Müller,et al. Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[25] Eric Darve,et al. A fast block low-rank dense solver with applications to finite-element matrices , 2014, J. Comput. Phys..

[26] Nicolas Le Roux,et al. Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.