论文信息 - WoodFisher: Efficient second-order approximations for model compression

WoodFisher: Efficient second-order approximations for model compression

Second-order information, in the form of Hessian- or Inverse-Hessian-vector products, is a fundamental tool for solving optimization problems. Recently, there has been a tremendous amount of work on utilizing this information for the current compute and memory-intensive deep neural networks, usually via coarse-grained approximations (such as diagonal, blockwise, or Kronecker-factorization). However, not much is known about the quality of these approximations. Our work addresses this question, and in particular, we propose a method called `WoodFisher' that leverages the structure of the empirical Fisher information matrix, along with the Woodbury matrix identity, to compute a faithful and efficient estimate of the inverse Hessian. Our main application is to the task of compressing neural networks, where we build on the classical Optimal Brain Damage/Surgeon framework (LeCun et al., 1990; Hassibi and Stork, 1993). We demonstrate that WoodFisher significantly outperforms magnitude pruning (isotropic Hessian), as well as methods that maintain other diagonal estimates. Further, even when gradual pruning is considered, our method results in a gain in test accuracy over the state-of-the-art approaches, for standard image classification datasets such as CIFAR-10, ImageNet. We also propose a variant called `WoodTaylor', which takes into account the first-order gradient term, and can lead to additional improvements. An important advantage of our methods is that they allow us to automatically set the layer-wise pruning thresholds, avoiding the need for any manual tuning or sensitivity analysis.

Dan Alistarh | Sidak Pal Singh | Dan Alistarh

[1] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.

[2] Tom Heskes,et al. On Natural Learning and Pruning in Multilayered Perceptrons , 2000, Neural Computation.

[3] Michael Carbin,et al. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[4] Jimmy Ba,et al. Kronecker-factored Curvature Approximations for Recurrent Neural Networks , 2018, ICLR.

[5] Nicol N. Schraudolph,et al. Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[6] Xin Dong,et al. Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon , 2017, NIPS.

[7] Elman Mansimov,et al. Second-order Optimization for Deep Reinforcement Learning using Kronecker-factored Approximation , 2017, NIPS 2017.

[8] Pascal Vincent,et al. An Evaluation of Fisher Approximations Beyond Kronecker Factorization , 2018, ICLR.

[9] Suyog Gupta,et al. To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[10] Martin Jaggi,et al. Model Fusion via Optimal Transport , 2019, NeurIPS.

[11] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.