Predictive Uncertainty in Large Scale Classification using Dropout - Stochastic Gradient Hamiltonian Monte Carlo

Predictive uncertainty is crucial for many computer vision tasks, from image classification to autonomous driving systems. Hamiltonian Monte Carlo (HMC) is an inference method for sampling complex posterior distributions. On the other hand, Dropout regularization has been proposed as an approximate model averaging technique that tends to improve generalization in large scale models such as deep neural networks. Although, HMC provides convergence guarantees for most standard Bayesian models, it does not handle discrete parameters arising from Dropout regularization. In this paper, we present a robust methodology for predictive uncertainty in large scale classification problems, based on Dropout and Stochastic Gradient Hamiltonian Monte Carlo. Even though Dropout induces a non-smooth energy function with no such convergence guarantees, the resulting discretization of the Hamiltonian proves empirical success. The proposed method allows to effectively estimate predictive accuracy and to provide better generalization for difficult test examples.

[1]  Jianfeng Lu,et al.  Discontinuous Hamiltonian Monte Carlo for models with discrete parameters and discontinuous likelihoods , 2017 .

[2]  Justin Domke,et al.  Reflection, Refraction, and Hamiltonian Monte Carlo , 2015, NIPS.

[3]  Nando de Freitas,et al.  Adaptive Hamiltonian and Riemann Manifold Monte Carlo , 2013, ICML.

[4]  Tal Hassner,et al.  Age and Gender Estimation of Unfiltered Faces , 2014, IEEE Transactions on Information Forensics and Security.

[5]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[6]  Babak Shahbaba,et al.  Split Hamiltonian Monte Carlo , 2011, Stat. Comput..

[7]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[8]  Michael Betancourt,et al.  A Conceptual Introduction to Hamiltonian Monte Carlo , 2017, 1701.02434.

[9]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[10]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[11]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[12]  S. Duane,et al.  Hybrid Monte Carlo , 1987 .

[13]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[14]  Andrew Gelman,et al.  The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo , 2011, J. Mach. Learn. Res..

[15]  M. Girolami,et al.  Riemann manifold Langevin and Hamiltonian Monte Carlo methods , 2011, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[16]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[17]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[18]  Jiqiang Guo,et al.  Stan: A Probabilistic Programming Language. , 2017, Journal of statistical software.

[19]  T. J. Mitchell,et al.  Bayesian Variable Selection in Linear Regression , 1988 .

[20]  Marcelo Pereyra,et al.  Proximal Markov chain Monte Carlo algorithms , 2013, Statistics and Computing.

[21]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[22]  Arnaud Doucet,et al.  Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach , 2014, ICML.

[23]  Zhe Gan,et al.  Learning Weight Uncertainty with Stochastic Gradient MCMC for Shape Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jean-Yves Tourneret,et al.  A Hamiltonian Monte Carlo Method for Non-Smooth Energy Sampling , 2014, IEEE Transactions on Signal Processing.

[25]  Simon J. D. Prince,et al.  Computer Vision: Models, Learning, and Inference , 2012 .

[26]  G. Roberts,et al.  Langevin Diffusions and Metropolis-Hastings Algorithms , 2002 .

[27]  J. Rosenthal,et al.  Optimal scaling of discrete approximations to Langevin diffusions , 1998 .

[28]  Siegfried Wahl,et al.  Leveraging uncertainty information from deep neural networks for disease detection , 2016, Scientific Reports.

[29]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[30]  Pierre Baldi,et al.  The dropout learning algorithm , 2014, Artif. Intell..

[31]  J. M. Sanz-Serna,et al.  Optimal tuning of the hybrid Monte Carlo algorithm , 2010, 1001.4460.

[32]  M. Stephens,et al.  Scalable Variational Inference for Bayesian Variable Selection in Regression, and Its Accuracy in Genetic Association Studies , 2012 .

[33]  Yoshua Bengio,et al.  An empirical analysis of dropout in piecewise linear networks , 2013, ICLR.

[34]  G. Casella,et al.  The Bayesian Lasso , 2008 .

[35]  Jon Doyle,et al.  Fast Hamiltonian Monte Carlo Using GPU Computing , 2014, 1402.4089.

[36]  Pierre Baldi,et al.  Understanding Dropout , 2013, NIPS.

[37]  Matthew D. Hoffman,et al.  Learning Deep Latent Gaussian Models with Markov Chain Monte Carlo , 2017, ICML.

[38]  James G. Scott,et al.  The horseshoe estimator for sparse signals , 2010 .

[39]  Liam Paninski,et al.  Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions , 2013, NIPS.

[40]  Dustin Tran,et al.  Deep Probabilistic Programming , 2017, ICLR.

[41]  Tal Hassner,et al.  Age and gender classification using convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[42]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.