论文信息 - Fast Convex Pruning of Deep Neural Networks - 字舞流文

Fast Convex Pruning of Deep Neural Networks

We develop a fast, tractable technique called Net-Trim for simplifying a trained neural network. The method is a convex post-processing module, which prunes (sparsifies) a trained network layer by layer, while preserving the internal responses. We present a comprehensive analysis of Net-Trim from both the algorithmic and sample complexity standpoints, centered on a fast, scalable convex optimization program. Our analysis includes consistency results between the initial and retrained models before and after Net-Trim application and guarantees on the number of training samples needed to discover a network that can be expressed using a certain number of nonzero terms. Specifically, if there is a set of weights that uses at most $s$ terms that can re-create the layer outputs from the layer inputs, we can find these weights from $\mathcal{O}(s\log N/s)$ samples, where $N$ is the input size. These theoretical results are similar to those for sparse regression using the Lasso, and our analysis uses some of the same recently-developed tools (namely recent results on the concentration of measure and convex analysis). Finally, we propose an algorithmic framework based on the alternating direction method of multipliers (ADMM), which allows a fast and simple implementation of Net-Trim for network pruning and compression.

Afshin Abdi | Justin Romberg | Alireza Aghasi | J. Romberg | A. Aghasi | A. Abdi

[1] E. Candès. The restricted isometry property and its implications for compressed sensing , 2008 .

[2] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3] Shahar Mendelson,et al. Learning without Concentration , 2014, COLT.

[4] Joel A. Tropp,et al. Convex recovery of a structured signal from independent random linear measurements , 2014, ArXiv.

[5] Afshin Abdi,et al. Net-Trim: Convex Pruning of Deep Neural Networks with Performance Guarantee , 2016, NIPS.

[6] Jürgen Schmidhuber,et al. Deep learning in neural networks: An overview , 2014, Neural Networks.

[7] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[8] V. Koltchinskii,et al. Bounding the smallest singular value of a random matrix without concentration , 2013, 1312.3580.

[9] Emmanuel J. Candès,et al. A Probabilistic and RIPless Theory of Compressed Sensing , 2010, IEEE Transactions on Information Theory.

[10] Roman Vershynin,et al. Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[11] Yann LeCun,et al. Regularization of Neural Networks using DropConnect , 2013, ICML.

[12] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[13] Arthur E. Hoerl,et al. Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[14] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[16] J. K. Hunter,et al. Measure Theory , 2007 .

[17] Geoffrey E. Hinton,et al. Simplifying Neural Networks by Soft Weight-Sharing , 1992, Neural Computation.

[18] David Gross,et al. Recovering Low-Rank Matrices From Few Coefficients in Any Basis , 2009, IEEE Transactions on Information Theory.

[19] Pablo A. Parrilo,et al. The Convex Geometry of Linear Inverse Problems , 2010, Foundations of Computational Mathematics.

[20] A. Krizhevsky. Convolutional Deep Belief Networks on CIFAR-10 , 2010 .

[21] Sham M. Kakade,et al. A tail inequality for quadratic forms of subgaussian random vectors , 2011, ArXiv.

[22] Tomaso A. Poggio,et al. Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[23] E. Candès,et al. Stable signal recovery from incomplete and inaccurate measurements , 2005, math/0503066.

[24] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[25] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[26] Stephen P. Boyd,et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[27] S. Mendelson. Learning without concentration for general loss functions , 2014, 1410.3192.

[28] Jon A. Wellner,et al. Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[29] Yixin Chen,et al. Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[30] Zhenyu Liao,et al. A Random Matrix Approach to Neural Networks , 2017, ArXiv.