Revisiting Loss Modelling for Unstructured Pruning

By removing parameters from deep neural networks, unstructured pruning methods aim at cutting down memory footprint and computational cost, while maintaining prediction accuracy. In order to tackle this otherwise intractable problem, many of these methods model the loss landscape using first or second order Taylor expansions to identify which parameters can be discarded. We revisit loss modelling for unstructured pruning: we show the importance of ensuring locality of the pruning steps. We systematically compare first and second order Taylor expansions and empirically show that both can reach similar levels of performance. Finally, we show that better preserving the original network function does not necessarily transfer to better performing networks after fine-tuning, suggesting that only considering the impact of pruning on the loss might not be a sufficient objective to design good pruning criteria.

[1]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[2]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Sanja Fidler,et al.  EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis , 2019, ICML.

[5]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[6]  Guodong Zhang,et al.  Picking Winning Tickets Before Training by Preserving Gradient Flow , 2020, ICLR.

[7]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[8]  Michael Carbin,et al.  Comparing Rewinding and Fine-tuning in Neural Network Pruning , 2019, ICLR.

[9]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[10]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[11]  Raquel Urtasun,et al.  MLPrune: Multi-Layer Pruning for Automated Neural Network Compression , 2018 .

[12]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[13]  Dan Alistarh,et al.  WoodFisher: Efficient Second-Order Approximation for Neural Network Compression , 2020, NeurIPS.

[14]  Philip H. S. Torr,et al.  A Signal Propagation Perspective for Pruning Neural Networks at Initialization , 2019, ICLR.

[15]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[16]  Philip H. S. Torr,et al.  SNIP: Single-shot Network Pruning based on Connection Sensitivity , 2018, ICLR.

[17]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[20]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[21]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[22]  Max Welling,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[23]  Tom Heskes,et al.  On Natural Learning and Pruning in Multilayered Perceptrons , 2000, Neural Computation.

[24]  Gintare Karolina Dziugaite,et al.  The Lottery Ticket Hypothesis at Scale , 2019, ArXiv.

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Dan Alistarh,et al.  WoodFisher: Efficient second-order approximations for model compression , 2020, ArXiv.

[27]  Gintare Karolina Dziugaite,et al.  Stabilizing the Lottery Ticket Hypothesis , 2019 .

[28]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[29]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[30]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[31]  Ji Liu,et al.  Global Sparse Momentum SGD for Pruning Very Deep Neural Networks , 2019, NeurIPS.

[32]  Jose Javier Gonzalez Ortiz,et al.  What is the State of Neural Network Pruning? , 2020, MLSys.

[33]  Pascal Vincent,et al.  Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis , 2018, NeurIPS.

[34]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[35]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[36]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.