Pruning neural networks without any data by iteratively conserving synaptic flow

Pruning the parameters of deep neural networks has generated intense interest due to potential savings in time, memory and energy both during training and at test time. Recent works have identified, through an expensive sequence of training and pruning cycles, the existence of winning lottery tickets or sparse trainable subnetworks at initialization. This raises a foundational question: can we identify highly sparse trainable subnetworks at initialization, without ever training, or indeed without ever looking at the data? We provide an affirmative answer to this question through theory driven algorithm design. We first mathematically formulate and experimentally verify a conservation law that explains why existing gradient-based pruning algorithms at initialization suffer from layer-collapse, the premature pruning of an entire layer rendering a network untrainable. This theory also elucidates how layer-collapse can be entirely avoided, motivating a novel pruning algorithm Iterative Synaptic Flow Pruning (SynFlow). This algorithm can be interpreted as preserving the total flow of synaptic strengths through the network at initialization subject to a sparsity constraint. Notably, this algorithm makes no reference to the training data and consistently competes with or outperforms existing state-of-the-art pruning algorithms at initialization over a range of models (VGG and ResNet), datasets (CIFAR-10/100 and Tiny ImageNet), and sparsity constraints (up to 99.99 percent). Thus our data-agnostic pruning algorithm challenges the existing paradigm that, at initialization, data must be used to quantify which synapses are important.

[1]  Andrew Zisserman,et al.  Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.

[2]  Gintare Karolina Dziugaite,et al.  Stabilizing the Lottery Ticket Hypothesis , 2019 .

[3]  Daniel M. Roy,et al.  Pruning Neural Networks at Initialization: Why are We Missing the Mark? , 2020, International Conference on Learning Representations.

[4]  Philip H. S. Torr,et al.  Progressive Skeletonization: Trimming more fat from a network at initialization , 2020, ICLR.

[5]  Russell Reed,et al.  Pruning algorithms-a survey , 1993, IEEE Trans. Neural Networks.

[6]  Wei Hu,et al.  Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced , 2018, NeurIPS.

[7]  Alexander Binder,et al.  On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , 2015, PloS one.

[8]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[9]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[10]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[11]  Raquel Urtasun,et al.  MLPrune: Multi-Layer Pruning for Automated Neural Network Compression , 2018 .

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  P. S. Castro,et al.  Rigging the Lottery: Making All Tickets Winners , 2019, ICML.

[14]  Ameya Prabhu,et al.  Deep Expander Networks: Efficient Deep Networks from Graph Theory , 2017, ECCV.

[15]  Yue Wang,et al.  Drawing early-bird tickets: Towards more efficient training of deep networks , 2019, ICLR.

[16]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[17]  Yuandong Tian,et al.  Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP , 2019, ICLR.

[18]  Timo Aila,et al.  Pruning Convolutional Neural Networks for Resource Efficient Inference , 2016, ICLR.

[19]  Daniel M. Roy,et al.  Linear Mode Connectivity and the Lottery Ticket Hypothesis , 2019, ICML.

[20]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[21]  Roger B. Grosse,et al.  Picking Winning Tickets Before Training by Preserving Gradient Flow , 2020, ICLR.

[22]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[23]  Jinwoo Shin,et al.  Lookahead: a Far-Sighted Alternative of Magnitude-based Pruning , 2020, ICLR.

[24]  Surya Ganguli,et al.  From deep learning to mechanistic understanding in neuroscience: the structure of retinal prediction , 2019, NeurIPS.

[25]  Peter Stone,et al.  Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science , 2017, Nature Communications.

[26]  Ruslan Salakhutdinov,et al.  Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[27]  Twan van Laarhoven,et al.  L2 Regularization versus Batch and Weight Normalization , 2017, ArXiv.

[28]  Xin Dong,et al.  Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon , 2017, NIPS.

[29]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[30]  Janowsky,et al.  Pruning versus clipping in neural networks. , 1989, Physical review. A, General physics.

[31]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[32]  Pavlo Molchanov,et al.  Importance Estimation for Neural Network Pruning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yee Whye Teh,et al.  Pruning untrained neural networks: Principles and Analysis , 2020, ArXiv.

[34]  Michael C. Mozer,et al.  Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment , 1988, NIPS.

[35]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[36]  Larry S. Davis,et al.  NISP: Pruning Networks Using Neuron Importance Score Propagation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[38]  Yurong Chen,et al.  Dynamic Network Surgery for Efficient DNNs , 2016, NIPS.

[39]  Yuandong Tian,et al.  One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers , 2019, NeurIPS.

[40]  Xin Wang,et al.  Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization , 2019, ICML.

[41]  Ehud D. Karnin,et al.  A simple procedure for pruning back-propagation trained neural networks , 1990, IEEE Trans. Neural Networks.

[42]  David Kappel,et al.  Deep Rewiring: Training very sparse deep networks , 2017, ICLR.

[43]  Mingjie Sun,et al.  Rethinking the Value of Network Pruning , 2018, ICLR.

[44]  Philip H. S. Torr,et al.  SNIP: Single-shot Network Pruning based on Connection Sensitivity , 2018, ICLR.

[45]  Jose Javier Gonzalez Ortiz,et al.  What is the State of Neural Network Pruning? , 2020, MLSys.

[46]  Alexander Novikov,et al.  Tensorizing Neural Networks , 2015, NIPS.

[47]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[48]  Philip H. S. Torr,et al.  A Signal Propagation Perspective for Pruning Neural Networks at Initialization , 2019, ICLR.

[49]  Maarten Stol,et al.  Pruning via Iterative Ranking of Sensitivity Statistics , 2020, ArXiv.

[50]  Mukund Sundararajan,et al.  How Important Is a Neuron? , 2018, ICLR.

[51]  Wojciech Samek,et al.  Pruning by Explaining: A Novel Criterion for Deep Neural Network Pruning , 2019, Pattern Recognit..

[52]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[53]  Jason Yosinski,et al.  Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask , 2019, NeurIPS.