SiPPing Neural Networks: Sensitivity-informed Provable Pruning of Neural Networks

We introduce a pruning algorithm that provably sparsifies the parameters of a trained model in a way that approximately preserves the model's predictive accuracy. Our algorithm uses a small batch of input points to construct a data-informed importance sampling distribution over the network's parameters, and adaptively mixes a sampling-based and deterministic pruning procedure to discard redundant weights. Our pruning method is simultaneously computationally efficient, provably accurate, and broadly applicable to various network architectures and data distributions. Our empirical comparisons show that our algorithm reliably generates highly compressed networks that incur minimal loss in performance relative to that of the original network. We present experimental results that demonstrate our algorithm's potential to unearth essential network connections that can be trained successfully in isolation, which may be of independent interest.

[1]  Sanjiv Kumar,et al.  Binary embeddings with structured hashed projections , 2015, ICML.

[2]  Misha Denil,et al.  Predicting Parameters in Deep Learning , 2014 .

[3]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[4]  V. Koltchinskii,et al.  High Dimensional Probability , 2006, math/0612726.

[5]  Andrew Zisserman,et al.  Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.

[6]  Mathieu Salzmann,et al.  Compression-aware Training of Deep Networks , 2017, NIPS.

[7]  Abhisek Kundu,et al.  A Note on Randomized Element-wise Matrix Sparsification , 2014, ArXiv.

[8]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[9]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[10]  Mingjie Sun,et al.  Rethinking the Value of Network Pruning , 2018, ICLR.

[11]  Vladimir Braverman,et al.  New Frameworks for Offline and Streaming Coreset Constructions , 2016, ArXiv.

[12]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[13]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[14]  Debmalya Panigrahi,et al.  A general framework for graph sparsification , 2010, STOC '11.

[15]  Jiwen Lu,et al.  Runtime Neural Pruning , 2017, NIPS.

[16]  David P. Woodruff,et al.  On Coresets for Logistic Regression , 2018, NeurIPS.

[17]  Victor S. Lempitsky,et al.  Fast ConvNets Using Group-Wise Brain Damage , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Joan Bruna,et al.  Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[19]  Tara N. Sainath,et al.  Structured Transforms for Small-Footprint Deep Learning , 2015, NIPS.

[20]  Wilko Schwarting,et al.  Training Support Vector Machines using Coresets , 2017, ArXiv.

[21]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[22]  Xiaogang Wang,et al.  Convolutional neural networks with low-rank regularization , 2015, ICLR.

[23]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[24]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Training Pruned Neural Networks , 2018, ArXiv.

[25]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC '11.

[26]  Afshin Abdi,et al.  Net-Trim: Convex Pruning of Deep Neural Networks with Performance Guarantee , 2016, NIPS.

[27]  Gintare Karolina Dziugaite,et al.  The Lottery Ticket Hypothesis at Scale , 2019, ArXiv.

[28]  Philip H. S. Torr,et al.  SNIP: Single-shot Network Pruning based on Connection Sensitivity , 2018, ICLR.

[29]  Eunhyeok Park,et al.  Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Dimitris Achlioptas,et al.  Matrix Entry-wise Sampling : Simple is Best [ Extended Abstract ] , 2013 .

[32]  Dacheng Tao,et al.  On Compressing Deep Models by Low Rank and Sparse Decomposition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[34]  Pierre Vandergheynst,et al.  Revisiting hard thresholding for DNN pruning , 2019, ArXiv.

[35]  Shih-Fu Chang,et al.  An Exploration of Parameter Redundancy in Deep Networks with Circulant Projections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Roberto Cipolla,et al.  Training CNNs with Low-Rank Filters for Efficient Image Classification , 2015, ICLR.

[37]  Petros Drineas,et al.  A note on element-wise matrix sparsification via a matrix-valued Bernstein inequality , 2010, Inf. Process. Lett..

[38]  Dan Feldman,et al.  Data-Dependent Coresets for Compressing Neural Networks with Applications to Generalization Bounds , 2018, ICLR.

[40]  Xin Dong,et al.  Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon , 2017, NIPS.

[41]  Stephen P. Boyd,et al.  CVXPY: A Python-Embedded Modeling Language for Convex Optimization , 2016, J. Mach. Learn. Res..

[42]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[43]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[44]  Yanzhi Wang,et al.  Theoretical Properties for Neural Networks with Weight Matrices of Low Displacement Rank , 2017, ICML.

[45]  Aravind Srinivasan,et al.  Improved Approximation Guarantees for Packing and Covering Integer Programs , 1999, SIAM J. Comput..

[46]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.