Dynamical Isometry: The Missing Ingredient for Neural Network Pruning

Several recent works [40, 24] observed an interesting phenomenon in neural network pruning: A larger finetuning learning rate can improve the final performance significantly. Unfortunately, the reason behind it remains elusive up to date. This paper is meant to explain it through the lens of dynamical isometry [42]. Specifically, we examine neural network pruning from an unusual perspective: pruning as initialization for finetuning, and ask whether the inherited weights serve as a good initialization for the finetuning? The insights from dynamical isometry suggest a negative answer. Despite its critical role, this issue has not been well-recognized by the community so far. In this paper, we will show the understanding of this problem is very important – on top of explaining the aforementioned mystery about the larger finetuning rate, it also unveils the mystery about the value of pruning [5, 30]. Besides a clearer theoretical understanding of pruning, resolving the problem can also bring us considerable performance benefits in practice.

[1]  Roger B. Grosse,et al.  Picking Winning Tickets Before Training by Preserving Gradient Flow , 2020, ICLR.

[2]  Ali Farhadi,et al.  What’s Hidden in a Randomly Weighted Neural Network? , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Gintare Karolina Dziugaite,et al.  Linear Mode Connectivity and the Lottery Ticket Hypothesis , 2019, ICML.

[4]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[5]  Jiri Matas,et al.  All you need is a good init , 2015, ICLR.

[6]  H. Robbins A Stochastic Approximation Method , 1951 .

[7]  Jascha Sohl-Dickstein,et al.  Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks , 2018, ICML.

[8]  Dan Alistarh,et al.  WoodFisher: Efficient second-order approximations for model compression , 2020, ArXiv.

[9]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[10]  Xiangyu Zhang,et al.  Channel Pruning for Accelerating Very Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Philipp Birken,et al.  Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[12]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[13]  Yu Wang,et al.  Exploring the Granularity of Sparsity in Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[14]  Hanqing Lu,et al.  Recent advances in efficient computation of deep convolutional neural networks , 2018, Frontiers of Information Technology & Electronic Engineering.

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  F. Mezzadri How to generate random matrices from the classical compact groups , 2006, math-ph/0609050.

[17]  Huan Wang,et al.  Structured Pruning for Efficient ConvNets via Incremental Regularization , 2018, 2019 International Joint Conference on Neural Networks (IJCNN).

[18]  Philip H. S. Torr,et al.  SNIP: Single-shot Network Pruning based on Connection Sensitivity , 2018, ICLR.

[19]  Michael C. Mozer,et al.  Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment , 1988, NIPS.

[20]  Yulun Zhang,et al.  Neural Pruning via Growing Regularization , 2020, ICLR.

[21]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[22]  Michael Carbin,et al.  Comparing Rewinding and Fine-tuning in Neural Network Pruning , 2019, ICLR.

[23]  Binh-Son Hua,et al.  Network Pruning That Matters: A Case Study on Retraining Variants , 2021, ICLR.

[24]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[25]  Tao Zhang,et al.  Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges , 2018, IEEE Signal Processing Magazine.

[26]  Leslie N. Smith,et al.  Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[27]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[28]  Jianxin Wu,et al.  ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[30]  Amos Storkey,et al.  A Closer Look at Structured Pruning for Neural Network Compression , 2018 .

[31]  Zhiqiang Shen,et al.  Learning Efficient Convolutional Networks through Network Slimming , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Lucas Theis,et al.  Faster gaze prediction with dense networks and Fisher pruning , 2018, ArXiv.

[33]  Trevor Darrell,et al.  Data-dependent Initializations of Convolutional Neural Networks , 2015, ICLR.

[34]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[36]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[37]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[38]  Ehud D. Karnin,et al.  A simple procedure for pruning back-propagation trained neural networks , 1990, IEEE Trans. Neural Networks.

[39]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[40]  Naiyan Wang,et al.  Data-Driven Sparse Structure Selection for Deep Neural Networks , 2017, ECCV.

[41]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[42]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[43]  Philip H. S. Torr,et al.  A Signal Propagation Perspective for Pruning Neural Networks at Initialization , 2019, ICLR.

[44]  Y. Fu,et al.  Emerging Paradigms of Neural Network Pruning , 2021, ArXiv.

[45]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[46]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[47]  Sanja Fidler,et al.  EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis , 2019, ICML.

[48]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[49]  Gintare Karolina Dziugaite,et al.  Pruning Neural Networks at Initialization: Why are We Missing the Mark? , 2020, ArXiv.

[50]  Jose Javier Gonzalez Ortiz,et al.  What is the State of Neural Network Pruning? , 2020, MLSys.

[51]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[52]  Russell Reed,et al.  Pruning algorithms-a survey , 1993, IEEE Trans. Neural Networks.

[53]  Mingjie Sun,et al.  Rethinking the Value of Network Pruning , 2018, ICLR.

[54]  Sheng Tang,et al.  Auto-Balanced Filter Pruning for Efficient Convolutional Neural Networks , 2018, AAAI.

[55]  Yuan Xie,et al.  Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey , 2020, Proceedings of the IEEE.