Compression-aware Training of Deep Networks

In recent years, great progress has been made in a variety of application domains thanks to the development of increasingly deeper neural networks. Unfortunately, the huge number of units of these networks makes them expensive both computationally and memory-wise. To overcome this, exploiting the fact that deep networks are over-parametrized, several compression strategies have been proposed. These methods, however, typically start from a network that has been trained in a standard manner, without considering such a future compression. In this paper, we propose to explicitly account for compression in the training process. To this end, we introduce a regularizer that encourages the parameter matrix of each layer to have low rank during training. We show that accounting for compression during training allows us to learn much more compact, yet at least as effective, models than state-of-the-art compression techniques.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  Xiaogang Wang,et al.  Convolutional neural networks with low-rank regularization , 2015, ICLR.

[3]  Roberto Cipolla,et al.  Training CNNs with Low-Rank Filters for Efficient Image Classification , 2015, ICLR.

[4]  Hui Jiang,et al.  Learning Convolutional Neural Networks using Hybrid Orthogonal Projection and Estimation , 2017, ACML.

[5]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[6]  Nicolas Vayatis,et al.  Estimation of Simultaneously Sparse and Low Rank Matrices , 2012, ICML.

[7]  Michael C. Mozer,et al.  Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment , 1988, NIPS.

[8]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[9]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[10]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[11]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[12]  Shiliang Zhang,et al.  Hybrid Orthogonal Projection and Estimation (HOPE): A New Framework to Probe and Learn Neural Networks , 2015, ArXiv.

[13]  Timo Aila,et al.  Pruning Convolutional Neural Networks for Resource Efficient Transfer Learning , 2016, ArXiv.

[14]  Hao Zhou,et al.  Less Is More: Towards Compact CNNs , 2016, ECCV.

[15]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[16]  Ross B. Girshick,et al.  Reducing Overfitting in Deep Networks by Decorrelating Representations , 2015, ICLR.

[17]  Pushmeet Kohli,et al.  Memory Bounded Deep Convolutional Networks , 2014, ArXiv.

[18]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[19]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[20]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[21]  Lars Petersson,et al.  DecomposeMe: Simplifying ConvNets for End-to-End Learning , 2016, ArXiv.

[22]  Basura Fernando,et al.  Generalized BackPropagation, Étude De Cas: Orthogonality , 2016, ArXiv.

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Misha Denil,et al.  Predicting Parameters in Deep Learning , 2014 .

[25]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[26]  Gregory J. Wolff,et al.  Optimal Brain Surgeon and general network pruning , 1993, IEEE International Conference on Neural Networks.

[27]  Hassan Foroosh,et al.  Sparse Convolutional Neural Networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Ivan V. Oseledets,et al.  Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition , 2014, ICLR.

[29]  David E. Rumelhart,et al.  Generalization by Weight-Elimination with Application to Forecasting , 1990, NIPS.

[30]  Max Welling,et al.  Soft Weight-Sharing for Neural Network Compression , 2017, ICLR.

[31]  Paris Smaragdis,et al.  NoiseOut: A Simple Way to Prune Neural Networks , 2016, ArXiv.

[32]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[33]  F. Xavier Roca,et al.  Regularizing CNNs with Locally Constrained Decorrelations , 2016, ICLR.

[34]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[35]  Mathieu Salzmann,et al.  Learning the Number of Neurons in Deep Networks , 2016, NIPS.

[36]  Russell Reed,et al.  Pruning algorithms-a survey , 1993, IEEE Trans. Neural Networks.

[37]  Wei Xiong,et al.  Regularizing Deep Convolutional Neural Networks with a Structured Decorrelation Constraint , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[38]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[39]  Yoshua Bengio,et al.  BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 , 2016, ArXiv.

[40]  Joan Bruna,et al.  Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[41]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[42]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[43]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[44]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[45]  Chuanyi Ji,et al.  Generalizing Smoothness Constraints from Discrete Samples , 1990, Neural Computation.

[46]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[47]  Andrew Zisserman,et al.  Deep Features for Text Spotting , 2014, ECCV.

[48]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[49]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[50]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[51]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[52]  Andrew Zisserman,et al.  Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.

[53]  Cong Xu,et al.  Coordinating Filters for Faster Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Yoshua Bengio,et al.  Slow, Decorrelated Features for Pretraining Complex Cell-like Networks , 2009, NIPS.