Scalable Neural Network Compression and Pruning Using Hard Clustering and L1 Regularization

We propose a simple and easy to implement neural network compression algorithm that achieves results competitive with more complicated state-of-the-art methods. The key idea is to modify the original optimization problem by adding K independent Gaussian priors (corresponding to the k-means objective) over the network parameters to achieve parameter quantization, as well as an L1 penalty to achieve pruning. Unlike many existing quantization-based methods, our method uses hard clustering assignments of network parameters, which adds minimal change or overhead to standard network training. We also demonstrate experimentally that tying neural network parameters provides less gain in generalization performance than changing network architecture and connectivity patterns entirely.

[1]  Misha Denil,et al.  Predicting Parameters in Deep Learning , 2014 .

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[4]  Luca Benini,et al.  Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations , 2017, NIPS.

[5]  Nicholas Ruozzi,et al.  Automatic Parameter Tying: A New Approach for Regularized Parameter Learning in Markov Networks , 2018, AAAI.

[6]  Joan Bruna,et al.  Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[7]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[9]  Jesse Hoey,et al.  APRICODD: Approximate Policy Construction Using Decision Diagrams , 2000, NIPS.

[10]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[11]  Haizhou Wang,et al.  Ckmeans.1d.dp: Optimal k-means Clustering in One Dimension by Dynamic Programming , 2011, R J..

[12]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[13]  Pedro M. Domingos,et al.  Approximation by Quantization , 2011, UAI.

[14]  Ming Yang,et al.  Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[15]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, ArXiv.

[16]  Max Welling,et al.  Bayesian Compression for Deep Learning , 2017, NIPS.

[17]  Max Welling,et al.  Soft Weight-Sharing for Neural Network Compression , 2017, ICLR.

[18]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[19]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[20]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[21]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[22]  Yixin Chen,et al.  Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[23]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[24]  Nicholas Ruozzi,et al.  On Parameter Tying by Quantization , 2016, AAAI.

[25]  Andrew Zisserman,et al.  Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.

[26]  Geoffrey E. Hinton,et al.  Simplifying Neural Networks by Soft Weight-Sharing , 1992, Neural Computation.

[27]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[28]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).