Variational Dropout and the Local Reparameterization Trick

We explore an as yet unexploited opportunity for drastically improving the efficiency of stochastic gradient variational Bayes (SGVB) with global model parameters. Regular SGVB estimators rely on sampling of parameters once per minibatch of data, and have variance that is constant w.r.t. the minibatch size. The efficiency of such estimators can be drastically improved upon by translating uncertainty about global parameters into local noise that is independent across datapoints in the minibatch. Such reparameterizations with local noise can be trivially parallelized and have variance that is inversely proportional to the minibatch size, generally leading to much faster convergence.We find an important connection with regularization by dropout: the original Gaussian dropout objective corresponds to SGVB with local noise, a scale-invariant prior and proportionally fixed posterior variance. Our method allows inference of more flexibly parameterized posteriors; specifically, we propose \emph{variational dropout}, a generalization of Gaussian dropout, but with a more flexibly parameterized posterior, often leading to better generalization. The method is demonstrated through several experiments.

[1]  Maximilian Karl,et al.  Fast Adaptive Weight Noise , 2015, ArXiv.

[2]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[3]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[4]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Shin-ichi Maeda,et al.  A Bayesian encourages dropout , 2014, ArXiv.

[7]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[8]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[9]  Brendan J. Frey,et al.  Adaptive dropout for training deep neural networks , 2013, NIPS.

[10]  Christopher D. Manning,et al.  Fast dropout training , 2013, ICML.

[11]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[12]  Diederik P. Kingma Fast Gradient-Based Inference with Continuous Latent Variable Models in Auxiliary Form , 2013, ArXiv.

[13]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons , 2013, ArXiv.

[14]  Geoffrey E. Hinton,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[15]  Tim Salimans,et al.  Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression , 2012, ArXiv.

[16]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[17]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[18]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[19]  H. Robbins A Stochastic Approximation Method , 1951 .

[20]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[21]  Anoop Korattikara Balan,et al.  Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring , 2012, International Conference on Machine Learning.

[22]  Radford M. Neal Bayesian learning for neural networks , 1995 .