Predictive Uncertainty Quantification with Compound Density Networks

Despite the huge success of deep neural networks (NNs), finding good mechanisms for quantifying their prediction uncertainty is still an open problem. Bayesian neural networks are one of the most popular approaches to uncertainty quantification. On the other hand, it was recently shown that ensembles of NNs, which belong to the class of mixture models, can be used to quantify prediction uncertainty. In this paper, we build upon these two approaches. First, we increase the mixture model's flexibility by replacing the fixed mixing weights by an adaptive, input-dependent distribution (specifying the probability of each component) represented by NNs, and by considering uncountably many mixture components. The resulting class of models can be seen as the continuous counterpart to mixture density networks and is therefore referred to as compound density networks (CDNs). We employ both maximum likelihood and variational Bayesian inference to train CDNs, and empirically show that they yield better uncertainty estimates on out-of-distribution data and are more robust to adversarial examples than the previous approaches.

[1]  Quoc V. Le,et al.  HyperNetworks , 2016, ICLR.

[2]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[3]  Ben Glocker,et al.  Implicit Weight Uncertainty in Neural Networks. , 2017 .

[4]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[5]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[6]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[10]  Didrik Nielsen,et al.  Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam , 2018, ICML.

[11]  Luc Van Gool,et al.  Dynamic Filter Networks , 2016, NIPS.

[12]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[13]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[14]  Mark J. F. Gales,et al.  Predictive Uncertainty Estimation via Prior Networks , 2018, NeurIPS.

[15]  Ian J. Goodfellow,et al.  Technical Report on the CleverHans v2.1.0 Adversarial Examples Library , 2016 .

[16]  S. Srihari Mixture Density Networks , 1994 .

[17]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[18]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[19]  Guodong Zhang,et al.  Noisy Natural Gradient as Variational Inference , 2017, ICML.

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[22]  Alexandre Lacoste,et al.  Bayesian Hypernetworks , 2017, ArXiv.

[23]  A. Dawid The Well-Calibrated Bayesian , 1982 .

[24]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[25]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[26]  A. Rollett,et al.  The Monte Carlo Method , 2004 .

[27]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[28]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[29]  David Barber,et al.  A Scalable Laplace Approximation for Neural Networks , 2018, ICLR.

[30]  Lawrence Carin,et al.  Learning Structured Weight Uncertainty in Bayesian Neural Networks , 2017, AISTATS.

[31]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[32]  H. Robbins A Stochastic Approximation Method , 1951 .

[33]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[34]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[35]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[36]  Brendan J. Frey,et al.  Adaptive dropout for training deep neural networks , 2013, NIPS.

[37]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[38]  Carsten Peterson,et al.  A Mean Field Theory Learning Algorithm for Neural Networks , 1987, Complex Syst..

[39]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[40]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[41]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[42]  Zachary Chase Lipton A Critical Review of Recurrent Neural Networks for Sequence Learning , 2015, ArXiv.

[43]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[44]  Finale Doshi-Velez,et al.  Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning , 2017, ICML.

[45]  Yann LeCun,et al.  Generalization and network design strategies , 1989 .

[46]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[47]  Max Welling,et al.  Multiplicative Normalizing Flows for Variational Bayesian Neural Networks , 2017, ICML.

[48]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[49]  Yee Whye Teh,et al.  Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex , 2013, NIPS.

[50]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[51]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[52]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[53]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[54]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[55]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[56]  A. Rukhin Matrix Variate Distributions , 1999, The Multivariate Normal Distribution.

[57]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[58]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[59]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[60]  Tom Goldstein,et al.  Are adversarial examples inevitable? , 2018, ICLR.

[61]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[62]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[63]  Jürgen Schmidhuber,et al.  Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , 1992, Neural Computation.

[64]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[65]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[66]  Alexander A. Alemi,et al.  Uncertainty in the Variational Information Bottleneck , 2018, ArXiv.

[67]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[68]  Kashif Rasul,et al.  Stochastic Maximum Likelihood Optimization via Hypernetworks , 2017, ArXiv.

[69]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[70]  Stephen E. Fienberg,et al.  The Comparison and Evaluation of Forecasters. , 1983 .

[71]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[72]  Radford M. Neal Bayesian Learning via Stochastic Dynamics , 1992, NIPS.

[73]  Finale Doshi-Velez,et al.  Learning and Policy Search in Stochastic Dynamical Systems with Bayesian Neural Networks , 2016, ICLR.

[74]  Max Welling,et al.  Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors , 2016, ICML.

[75]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[76]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[77]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  A. Kiureghian,et al.  Aleatory or epistemic? Does it matter? , 2009 .

[79]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[80]  Neil D. Lawrence,et al.  Deep Gaussian Processes , 2012, AISTATS.

[81]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[82]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..