Latent Derivative Bayesian Last Layer Networks

Bayesian neural networks (BNN) are powerful parametric models for nonlinear regression with uncertainty quantification. However, the approximate inference techniques for weight space priors suffer from several drawbacks. The ‘Bayesian last layer’ (BLL) is an alternative BNN approach that learns the feature space for an exact Bayesian linear model with explicit predictive distributions. However, its predictions outside of the data distribution (OOD) are typically overconfident, as the marginal likelihood objective results in a learned feature space that overfits to the data. We overcome this weakness by introducing a functional prior on the model’s derivatives w.r.t. the inputs. Treating these Jacobians as latent variables, we incorporate the prior into the objective to influence the smoothness and diversity of the features, which enables greater predictive uncertainty. For the BLL, the Jacobians can be computed directly using forward mode automatic differentiation, and the distribution over Jacobians may be obtained in closed-form. We demonstrate this method enhances the BLL to Gaussian process-like performance on tasks where calibrated uncertainty is critical: OOD regression, Bayesian optimization and active learning, which include high-dimensional real-world datasets.

[1]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[2]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[3]  Albert Tarantola,et al.  Inverse problem theory - and methods for model parameter estimation , 2004 .

[4]  C. Morris Parametric Empirical Bayes Inference: Theory and Applications , 1983 .

[5]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[6]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[7]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[8]  Christopher K. I. Williams Computing with Infinite Networks , 1996, NIPS.

[9]  Richard E. Turner,et al.  'In-Between' Uncertainty in Bayesian Neural Networks , 2019, ArXiv.

[10]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[11]  Carl E. Rasmussen,et al.  Derivative Observations in Gaussian Process Models of Dynamic Systems , 2002, NIPS.

[12]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[13]  Carsten Peterson,et al.  Explorations of the mean field theory learning algorithm , 1989, Neural Networks.

[14]  Ian Osband,et al.  The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[15]  Noah Weber,et al.  Optimizing over a Bayesian Last Layer , 2018 .

[16]  Daniel Flam-Shepherd Mapping Gaussian Process Priors to Bayesian Neural Networks , 2017 .

[17]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[18]  Neil D. Lawrence,et al.  Gaussian Processes for Big Data , 2013, UAI.

[19]  Melih Kandemir,et al.  Sampling-Free Variational Inference of Bayesian Neural Networks by Variance Backpropagation , 2018, UAI.

[20]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[21]  Thomas P. Minka,et al.  Divergence measures and message passing , 2005 .

[22]  Noah D. Goodman,et al.  Pyro: Deep Universal Probabilistic Programming , 2018, J. Mach. Learn. Res..

[23]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[24]  James Hensman,et al.  On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes , 2015, AISTATS.

[25]  José Miguel Hernández-Lobato,et al.  Variational Implicit Processes , 2018, ICML.

[26]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[27]  Aníbal R. Figueiras-Vidal,et al.  Marginalized Neural Network Mixtures for Large-Scale Regression , 2010, IEEE Transactions on Neural Networks.

[28]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[29]  Prabhat,et al.  Scalable Bayesian Optimization Using Deep Neural Networks , 2015, ICML.

[30]  C. R. Smith,et al.  Maximum-Entropy and Bayesian Methods in Inverse Problems , 1985 .

[31]  Yann LeCun,et al.  Transforming Neural-Net Output Levels to Probability Distributions , 1990, NIPS.

[32]  Yoshua Bengio,et al.  The Curse of Dimensionality for Local Kernel Machines , 2005 .

[33]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[34]  Neil D. Lawrence,et al.  Deep Gaussian Processes , 2012, AISTATS.

[35]  Charles M. Bishop,et al.  Ensemble learning in Bayesian neural networks , 1998 .

[36]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[37]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[38]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[39]  A. P. Dawid,et al.  Gaussian Processes to Speed up Hybrid Monte Carlo for Expensive Bayesian Integrals , 2003 .

[40]  David Barber,et al.  A Scalable Laplace Approximation for Neural Networks , 2018, ICLR.

[41]  Dustin Tran,et al.  Automatic Differentiation Variational Inference , 2016, J. Mach. Learn. Res..

[42]  Jasper Snoek,et al.  Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[43]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[44]  G. C. Tiao,et al.  Bayesian inference in statistical analysis , 1973 .

[45]  Tim Pearce,et al.  Uncertainty in Neural Networks: Approximately Bayesian Ensembling , 2018, AISTATS.

[46]  Sebastian Nowozin,et al.  Deterministic Variational Inference for Robust Bayesian Neural Networks , 2018, ICLR.

[47]  Guodong Zhang,et al.  Functional Variational Bayesian Neural Networks , 2019, ICLR.

[48]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[49]  Marc Peter Deisenroth,et al.  Doubly Stochastic Variational Inference for Deep Gaussian Processes , 2017, NIPS.

[50]  Temple F. Smith Occam's razor , 1980, Nature.

[51]  Daniel R. Jiang,et al.  BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization , 2020, NeurIPS.

[52]  Carl E. Rasmussen,et al.  Manifold Gaussian Processes for regression , 2014, 2016 International Joint Conference on Neural Networks (IJCNN).

[53]  Tim Pearce,et al.  Expressive Priors in Bayesian Neural Networks: Kernel Combinations and Periodic Functions , 2019, UAI.

[54]  Sebastian Nowozin,et al.  How Good is the Bayes Posterior in Deep Neural Networks Really? , 2020, ICML.

[55]  Mark J. F. Gales,et al.  Predictive Uncertainty Estimation via Prior Networks , 2018, NeurIPS.

[56]  Soumya Ghosh,et al.  Quality of Uncertainty Quantification for Bayesian Neural Network Inference , 2019, ArXiv.

[57]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[58]  Louis B. Rall,et al.  Automatic Differentiation: Techniques and Applications , 1981, Lecture Notes in Computer Science.

[59]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[60]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[61]  Matthias W. Seeger,et al.  Bayesian Gaussian process models : PAC-Bayesian generalisation error bounds and sparse approximations , 2003 .

[62]  Mohammad Emtiyaz Khan,et al.  Scalable Training of Inference Networks for Gaussian-Process Models , 2019, ICML.

[63]  Didrik Nielsen,et al.  Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam , 2018, ICML.

[64]  Michalis K. Titsias,et al.  Variational Learning of Inducing Variables in Sparse Gaussian Processes , 2009, AISTATS.

[65]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[66]  Daniel Hernández-Lobato,et al.  Deep Gaussian Processes for Regression using Approximate Expectation Propagation , 2016, ICML.

[67]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[68]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[69]  José Miguel Hernández-Lobato,et al.  Bayesian Batch Active Learning as Sparse Subset Approximation , 2019, NeurIPS.

[70]  Richard E. Turner,et al.  On the Expressiveness of Approximate Inference in Bayesian Neural Networks , 2019, NeurIPS.

[71]  Stefan Schaal,et al.  Locally Weighted Projection Regression : An O(n) Algorithm for Incremental Real Time Learning in High Dimensional Space , 2000 .

[72]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[73]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[74]  Aidan N. Gomez,et al.  Benchmarking Bayesian Deep Learning with Diabetic Retinopathy Diagnosis , 2019 .

[75]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[76]  Andrew Gordon Wilson,et al.  Deep Kernel Learning , 2015, AISTATS.

[77]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[78]  Sebastian W. Ober,et al.  Benchmarking the Neural Linear Model for Regression , 2019, ArXiv.

[79]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[80]  Andrew Gordon Wilson,et al.  Student-t Processes as Alternatives to Gaussian Processes , 2014, AISTATS.