Variational Bayesian Dropout With a Hierarchical Prior

Variational dropout (VD) is a generalization of Gaussian dropout, which aims at inferring the posterior of network weights based on a log-uniform prior on them to learn these weights as well as dropout rate simultaneously. The log-uniform prior not only interprets the regularization capacity of Gaussian dropout in network training, but also underpins the inference of such posterior. However, the log-uniform prior is an improper prior (i.e., its integral is infinite), which causes the inference of posterior to be ill-posed, thus restricting the regularization performance of VD. To address this problem, we present a new generalization of Gaussian dropout, termed variational Bayesian dropout (VBD), which turns to exploit a hierarchical prior on the network weights and infer a new joint posterior. Specifically, we implement the hierarchical prior as a zero-mean Gaussian distribution with variance sampled from a uniform hyper-prior. Then, we incorporate such a prior into inferring the joint posterior over network weights and the variance in the hierarchical prior, with which both the network training and dropout rate estimation can be cast into a joint optimization problem. More importantly, the hierarchical prior is a proper prior which enables the inference of posterior to be well-posed. In addition, we further show that the proposed VBD can be seamlessly applied to network compression. Experiments on classification and network compression demonstrate the superior performance of the proposed VBD in regularizing network training.

[1]  Christopher D. Manning,et al.  Fast dropout training , 2013, ICML.

[2]  Dmitry P. Vetrov,et al.  Structured Bayesian Pruning via Log-Normal Multiplicative Noise , 2017, NIPS.

[3]  Roberto Cipolla,et al.  Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding , 2015, BMVC.

[4]  Aggelos K. Katsaggelos,et al.  Bayesian Blind Deconvolution with General Sparse Image Priors , 2012, ECCV.

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Terrence J. Sejnowski,et al.  Variational Learning of Clusters of Undercomplete Nonsymmetric Independent Components , 2003, J. Mach. Learn. Res..

[7]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[8]  Finale Doshi-Velez,et al.  Structured Variational Learning of Bayesian Neural Networks with Horseshoe Priors , 2018, ICML.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[11]  Yuhang Liu,et al.  Frame-Based Variational Bayesian Learning for Independent or Dependent Source Separation , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[12]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[13]  Richard E. Turner,et al.  Overpruning in Variational Bayesian Neural Networks , 2018, 1801.06230.

[14]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[16]  Ian D. Reid,et al.  From Motion Blur to Motion Flow: A Deep Learning Solution for Removing Heterogeneous Motion Blur , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[19]  Wei Wei,et al.  Reweighted laplace prior based hyperspectral compressive sensing for unknown sparsity , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[21]  Zoubin Ghahramani,et al.  Variational Gaussian Dropout is not Bayesian , 2017, 1711.02989.

[22]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[23]  Yurong Chen,et al.  Dynamic Network Surgery for Efficient DNNs , 2016, NIPS.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  David P. Wipf,et al.  Compressing Neural Networks using the Variational Information Bottleneck , 2018, ICML.

[26]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[27]  Mengdi Wang,et al.  Strong NP-Hardness for Sparse Optimization with Concave Penalty Functions , 2015, ICML.

[28]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[29]  Max Welling,et al.  Bayesian Compression for Deep Learning , 2017, NIPS.

[30]  Max Welling,et al.  Soft Weight-Sharing for Neural Network Compression , 2017, ICLR.

[31]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[32]  Alex Kendall,et al.  Concrete Dropout , 2017, NIPS.

[33]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[34]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevan e Ve tor Ma hine , 2001 .

[35]  Lei Zhang,et al.  Deblurring Natural Image Using Super-Gaussian Fields , 2018, ECCV.

[36]  Il-Chul Moon,et al.  Adversarial Dropout for Supervised and Semi-supervised Learning , 2017, AAAI.

[37]  Zoubin Ghahramani,et al.  Variational Bayesian dropout: pitfalls and fixes , 2018, ICML.

[38]  Wei Wei,et al.  Cluster Sparsity Field: An Internal Hyperspectral Imagery Prior for Reconstruction , 2018, International Journal of Computer Vision.