WoodFisher: Efficient second-order approximations for model compression

Second-order information, in the form of Hessian- or Inverse-Hessian-vector products, is a fundamental tool for solving optimization problems. Recently, there has been a tremendous amount of work on utilizing this information for the current compute and memory-intensive deep neural networks, usually via coarse-grained approximations (such as diagonal, blockwise, or Kronecker-factorization). However, not much is known about the quality of these approximations. Our work addresses this question, and in particular, we propose a method called `WoodFisher' that leverages the structure of the empirical Fisher information matrix, along with the Woodbury matrix identity, to compute a faithful and efficient estimate of the inverse Hessian. Our main application is to the task of compressing neural networks, where we build on the classical Optimal Brain Damage/Surgeon framework (LeCun et al., 1990; Hassibi and Stork, 1993). We demonstrate that WoodFisher significantly outperforms magnitude pruning (isotropic Hessian), as well as methods that maintain other diagonal estimates. Further, even when gradual pruning is considered, our method results in a gain in test accuracy over the state-of-the-art approaches, for standard image classification datasets such as CIFAR-10, ImageNet. We also propose a variant called `WoodTaylor', which takes into account the first-order gradient term, and can lead to additional improvements. An important advantage of our methods is that they allow us to automatically set the layer-wise pruning thresholds, avoiding the need for any manual tuning or sensitivity analysis.

[1]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[2]  Tom Heskes,et al.  On Natural Learning and Pruning in Multilayered Perceptrons , 2000, Neural Computation.

[3]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[4]  Jimmy Ba,et al.  Kronecker-factored Curvature Approximations for Recurrent Neural Networks , 2018, ICLR.

[5]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[6]  Xin Dong,et al.  Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon , 2017, NIPS.

[7]  Elman Mansimov,et al.  Second-order Optimization for Deep Reinforcement Learning using Kronecker-factored Approximation , 2017, NIPS 2017.

[8]  Pascal Vincent,et al.  An Evaluation of Fisher Approximations Beyond Kronecker Factorization , 2018, ICLR.

[9]  Suyog Gupta,et al.  To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[10]  Martin Jaggi,et al.  Model Fusion via Optimal Transport , 2019, NeurIPS.

[11]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Max Welling,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[13]  Yurong Chen,et al.  Dynamic Network Surgery for Efficient DNNs , 2016, NIPS.

[14]  Frederik Kunstner,et al.  Limitations of the empirical Fisher approximation for natural gradient descent , 2019, NeurIPS.

[15]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  Roger B. Grosse,et al.  A Kronecker-factored approximate Fisher matrix for convolution layers , 2016, ICML.

[19]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[20]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[21]  Luke Zettlemoyer,et al.  Sparse Networks from Scratch: Faster Training without Losing Performance , 2019, ArXiv.

[22]  Lucas Theis,et al.  Faster gaze prediction with dense networks and Fisher pruning , 2018, ArXiv.

[23]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[24]  Yixin Chen,et al.  Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[25]  Sanja Fidler,et al.  EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis , 2019, ICML.

[26]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[27]  Raquel Urtasun,et al.  MLPrune: Multi-Layer Pruning for Automated Neural Network Compression , 2018 .

[28]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[29]  Yi-Ming Chan,et al.  Unifying and Merging Well-trained Deep Neural Networks for Inference Stage , 2018, IJCAI.

[30]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[31]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[32]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[33]  Michael C. Mozer,et al.  Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment , 1988, NIPS.

[34]  Martin Jaggi,et al.  Dynamic Model Pruning with Feedback , 2020, ICLR.

[35]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[36]  Roger B. Grosse,et al.  Distributed Second-Order Optimization using Kronecker-Factored Approximations , 2016, ICLR.

[37]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[38]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[39]  Tao Zhang,et al.  A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[40]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[41]  Olatunji Ruwase,et al.  ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, SC.

[42]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[43]  Jianxin Wu,et al.  ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[45]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[46]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[47]  Rif A. Saurous,et al.  Neumann Optimizer: A Practical Optimization Algorithm for Deep Neural Networks , 2017, ICLR.

[48]  Satoshi Matsuoka,et al.  Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Naman Agarwal,et al.  Second-Order Stochastic Optimization for Machine Learning in Linear Time , 2016, J. Mach. Learn. Res..

[50]  Miguel Á. Carreira-Perpiñán,et al.  "Learning-Compression" Algorithms for Neural Net Pruning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Xin Wang,et al.  Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization , 2019, ICML.

[53]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..