Learning Sparse Neural Networks through L0 Regularization

We propose a practical method for $L_0$ norm regularization for neural networks: pruning the network during training by encouraging weights to become exactly zero. Such regularization is interesting since (1) it can greatly speed up training and inference, and (2) it can improve generalization. AIC and BIC, well-known model selection criteria, are special cases of $L_0$ regularization. However, since the $L_0$ norm of weights is non-differentiable, we cannot incorporate it directly as a regularization term in the objective function. We propose a solution through the inclusion of a collection of non-negative stochastic gates, which collectively determine which weights to set to zero. We show that, somewhat surprisingly, for certain distributions over the gates, the expected $L_0$ norm of the resulting gated weights is differentiable with respect to the distribution parameters. We further propose the \emph{hard concrete} distribution for the gates, which is obtained by "stretching" a binary concrete distribution and then transforming its samples with a hard-sigmoid. The parameters of the distribution over the gates can then be jointly optimized with the original network parameters. As a result our method allows for straightforward and efficient learning of model structures with stochastic gradient descent and allows for conditional computation in a principled way. We perform various experiments to demonstrate the effectiveness of the resulting approach and regularizer.

[1]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[2]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[3]  T. J. Mitchell,et al.  Bayesian Variable Selection in Linear Regression , 1988 .

[4]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[5]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[6]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[7]  Geoffrey E. Hinton,et al.  Generative models for discovering sparse distributed representations. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[8]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[9]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[10]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[11]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Markus Harva,et al.  Variational learning for rectified factor analysis , 2007, Signal Process..

[13]  Junzhou Huang,et al.  Learning with structured sparsity , 2009, ICML '09.

[14]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[15]  Karol Gregor,et al.  Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[16]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[17]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[18]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Jonathan Tompson,et al.  Efficient object localization using Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[22]  Joelle Pineau,et al.  Conditional Computation in Neural Networks for faster models , 2015, ArXiv.

[23]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[24]  Andriy Mnih,et al.  Variational Inference for Monte Carlo Objectives , 2016, ICML.

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[27]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[28]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[29]  Tim Salimans,et al.  A Structured Variational Auto-encoder for Learning Deep Hierarchies of Sparse Features , 2016, ArXiv.

[30]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[31]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[32]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[33]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[34]  Max Welling,et al.  Soft Weight-Sharing for Neural Network Compression , 2017, ICLR.

[35]  Jason Tyler Rolfe,et al.  Discrete Variational Autoencoders , 2016, ICLR.

[36]  Max Welling,et al.  Bayesian Compression for Deep Learning , 2017, NIPS.

[37]  Jascha Sohl-Dickstein,et al.  REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models , 2017, NIPS.

[38]  R. Venkatesh Babu,et al.  Training Sparse Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[39]  Dmitry P. Vetrov,et al.  Structured Bayesian Pruning via Log-Normal Multiplicative Noise , 2017, NIPS.