Alpha-Divergences in Variational Dropout

We investigate the use of alternative divergences to Kullback-Leibler (KL) in variational inference(VI), based on the Variational Dropout \cite{kingma2015}. Stochastic gradient variational Bayes (SGVB) \cite{aevb} is a general framework for estimating the evidence lower bound (ELBO) in Variational Bayes. In this work, we extend the SGVB estimator with using Alpha-Divergences, which are alternative to divergences to VI' KL objective. The Gaussian dropout can be seen as a local reparametrization trick of the SGVB objective. We extend the Variational Dropout to use alpha divergences for variational inference. Our results compare $\alpha$-divergence variational dropout with standard variational dropout with correlated and uncorrelated weight noise. We show that the $\alpha$-divergence with $\alpha \rightarrow 1$ (or KL divergence) is still a good measure for use in variational inference, in spite of the efficient use of Alpha-divergences for Dropout VI \cite{Li17}. $\alpha \rightarrow 1$ can yield the lowest training error, and optimizes a good lower bound for the evidence lower bound (ELBO) among all values of the parameter $\alpha \in [0,\infty)$.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  Alexandre Lacoste,et al.  Bayesian Hypernetworks , 2017, ArXiv.

[3]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[4]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[5]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[6]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[7]  Christopher D. Manning,et al.  Fast dropout training , 2013, ICML.

[8]  Yarin Gal,et al.  Dropout Inference in Bayesian Neural Networks with Alpha-divergences , 2017, ICML.

[9]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[10]  Peter Harremoës,et al.  Rényi Divergence and Kullback-Leibler Divergence , 2012, IEEE Transactions on Information Theory.

[11]  Richard E. Turner,et al.  Rényi Divergence Variational Inference , 2016, NIPS.

[12]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[13]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[14]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[15]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[16]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[17]  Zoubin Ghahramani,et al.  Variational Gaussian Dropout is not Bayesian , 2017, 1711.02989.

[18]  Richard E. Turner,et al.  Black-box α-divergence minimization , 2016, ICML 2016.

[19]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[20]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[21]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[22]  Zoubin Ghahramani,et al.  Deep Bayesian Active Learning with Image Data , 2017, ICML.