Variational Dropout Sparsifies Deep Neural Networks

We explore a recently proposed Variational Dropout technique that provided an elegant Bayesian interpretation to Gaussian Dropout. We extend Variational Dropout to the case when dropout rates are unbounded, propose a way to reduce the variance of the gradient estimator and report first experimental results with individual dropout rates per weight. Interestingly, it leads to extremely sparse solutions both in fully-connected and convolutional layers. This effect is similar to automatic relevance determination effect in empirical Bayes but has a number of advantages. We reduce the number of parameters up to 280 times on LeNet architectures and up to 68 times on VGG-like networks with a negligible decrease of accuracy.

[1]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[2]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[3]  David J. C. MacKay,et al.  BAYESIAN NON-LINEAR MODELING FOR THE PREDICTION COMPETITION , 1996 .

[4]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[5]  J. Suykens,et al.  Automatic relevance determination for least squares support vector machine regression , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[6]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[7]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[8]  Razvan Pascanu,et al.  Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[9]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[10]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[11]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[12]  David Barber,et al.  Gaussian Kullback-Leibler approximate inference , 2013, J. Mach. Learn. Res..

[13]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[14]  Christopher D. Manning,et al.  Fast dropout training , 2013, ICML.

[15]  Shin-ichi Maeda,et al.  A Bayesian encourages dropout , 2014, ArXiv.

[16]  Miguel Lázaro-Gredilla,et al.  Doubly Stochastic Variational Bayes for non-Conjugate Inference , 2014, ICML.

[17]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[18]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[19]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[20]  Dmitry Vetrov,et al.  Relevance tagging machine , 2015 .

[21]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[22]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Yg,et al.  Dropout as a Bayesian Approximation : Insights and Applications , 2015 .

[25]  Hassan Foroosh,et al.  Sparse Convolutional Neural Networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  J. Mooij,et al.  Smart Regularization of Deep Architectures , 2015 .

[27]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Alexander Novikov,et al.  Tensorizing Neural Networks , 2015, NIPS.

[29]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[30]  Ole Winther,et al.  How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks , 2016, ICML 2016.

[31]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[33]  Stefano Soatto,et al.  Information Dropout: learning optimal representations through noise , 2017, ArXiv.

[34]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[35]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[36]  R. Venkatesh Babu,et al.  Generalized Dropout , 2016, ArXiv.

[37]  Alexander Novikov,et al.  Ultimate tensorization: compressing convolutional and FC layers alike , 2016, ArXiv.

[38]  Victor S. Lempitsky,et al.  Fast ConvNets Using Group-Wise Brain Damage , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Yurong Chen,et al.  Dynamic Network Surgery for Efficient DNNs , 2016, NIPS.

[40]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[41]  Pieter Abbeel,et al.  Variational Lossy Autoencoder , 2016, ICLR.

[42]  Max Welling,et al.  Soft Weight-Sharing for Neural Network Compression , 2017, ICLR.

[43]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[44]  Danilo Comminiello,et al.  Group sparse regularization for deep neural networks , 2016, Neurocomputing.

[45]  Mark Sandler,et al.  The Power of Sparsity in Convolutional Neural Networks , 2017, ArXiv.

[46]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.