Stationary Activations for Uncertainty Calibration in Deep Learning

We introduce a new family of non-linear neural network activation functions that mimic the properties induced by the widely-used Matern family of kernels in Gaussian process (GP) models. This class spans a range of locally stationary models of various degrees of mean-square differentiability. We show an explicit link to the corresponding GP models in the case that the network consists of one infinitely wide hidden layer. In the limit of infinite smoothness the Matern family results in the RBF kernel, and in this case we recover RBF activations. Matern activation functions result in similar appealing properties to their counterparts in GP models, and we demonstrate that the local stationarity property together with limited mean-square differentiability shows both good performance and uncertainty calibration in Bayesian deep learning tasks. In particular, local stationarity helps calibrate out-of-distribution (OOD) uncertainty. We demonstrate these properties on classification and regression benchmarks and a radar emitter classification task.

[1]  David Duvenaud,et al.  Automatic model construction with Gaussian processes , 2014 .

[2]  Thomas G. Dietterich,et al.  Deep Anomaly Detection with Outlier Exposure , 2018, ICLR.

[3]  HighWire Press Philosophical transactions of the Royal Society of London. Series A, Containing papers of a mathematical or physical character , 1896 .

[4]  Sebastian Nowozin,et al.  How Good is the Bayes Posterior in Deep Neural Networks Really? , 2020, ICML.

[5]  Zoubin Ghahramani,et al.  Adversarial Examples, Uncertainty, and Transfer Testing Robustness in Gaussian Process Hybrid Deep Networks , 2017, 1707.02476.

[6]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jason Yosinski,et al.  Deep neural networks are easily fooled: High confidence predictions for unrecognizable images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Arno Solin,et al.  Hilbert space methods for reduced-rank Gaussian process regression , 2014, Stat. Comput..

[9]  David S. Broomhead,et al.  Multivariable Functional Interpolation and Adaptive Networks , 1988, Complex Syst..

[10]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[11]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[12]  D. Broomhead,et al.  Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive Networks , 1988 .

[13]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[14]  Jasper Snoek,et al.  Likelihood Ratios for Out-of-Distribution Detection , 2019, NeurIPS.

[15]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[16]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[17]  Yee Whye Teh,et al.  Do Deep Generative Models Know What They Don't Know? , 2018, ICLR.

[18]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[19]  Guodong Zhang,et al.  Functional Variational Bayesian Neural Networks , 2019, ICLR.

[20]  Daniel Flam-Shepherd Mapping Gaussian Process Priors to Bayesian Neural Networks , 2017 .

[21]  Matthias Hein,et al.  A randomized gradient-free attack on ReLU networks , 2018, GCPR.

[22]  Raman Arora,et al.  Understanding Deep Neural Networks with Rectified Linear Units , 2016, Electron. Colloquium Comput. Complex..

[23]  Marc G. Genton,et al.  Classes of Kernels for Machine Learning: A Statistics Perspective , 2002, J. Mach. Learn. Res..

[24]  B. Matérn Spatial variation : Stochastic models and their application to some problems in forest surveys and other sampling investigations , 1960 .

[25]  I. M. Glazman,et al.  Theory of linear operators in Hilbert space , 1961 .

[26]  Andrew Gordon Wilson,et al.  Stochastic Variational Deep Kernel Learning , 2016, NIPS.

[27]  Gordon Wetzstein,et al.  Implicit Neural Representations with Periodic Activation Functions , 2020, NeurIPS.

[28]  Huaiqin Wu Global stability analysis of a general class of discontinuous neural networks with linear growth activation functions , 2009, Inf. Sci..

[29]  Terrance E. Boult,et al.  Reducing Network Agnostophobia , 2018, NeurIPS.

[30]  R. Srikant,et al.  Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks , 2017, ICLR.

[31]  Christopher K. I. Williams Computing with Infinite Networks , 1996, NIPS.

[32]  James Hensman,et al.  Scalable Variational Gaussian Process Classification , 2014, AISTATS.

[33]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[34]  David Barber,et al.  A Scalable Laplace Approximation for Neural Networks , 2018, ICLR.

[35]  L. Ljung,et al.  Control theory : multivariable and nonlinear methods , 2000 .

[36]  James J. Little,et al.  A Less Biased Evaluation of Out-of-distribution Sample Detectors , 2018, BMVC.

[37]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[38]  T. Kurtz,et al.  Stochastic equations in infinite dimensions , 2006 .

[39]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[40]  Alexander A. Alemi,et al.  WAIC, but Why? Generative Ensembles for Robust Anomaly Detection , 2018 .

[41]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[42]  Richard E. Turner,et al.  Gaussian Process Behaviour in Wide Deep Neural Networks , 2018, ICLR.

[43]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[44]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[45]  G. Wahba Spline models for observational data , 1990 .

[46]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[47]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[48]  Carl E. Rasmussen,et al.  Sparse Spectrum Gaussian Process Regression , 2010, J. Mach. Learn. Res..

[49]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[50]  Matthias Hein,et al.  Why ReLU Networks Yield High-Confidence Predictions Far Away From the Training Data and How to Mitigate the Problem , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Richard E. Turner,et al.  Pathologies of Factorised Gaussian and MC Dropout Posteriors in Bayesian Neural Networks , 2019, ArXiv.

[52]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[53]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .

[54]  Alexis Boukouvalas,et al.  GPflow: A Gaussian Process Library using TensorFlow , 2016, J. Mach. Learn. Res..

[55]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[56]  Andrew Gordon Wilson,et al.  GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration , 2018, NeurIPS.

[57]  Marcus Gallagher,et al.  Invariance of Weight Distributions in Rectified MLPs , 2017, ICML.

[58]  Tim Pearce,et al.  Expressive Priors in Bayesian Neural Networks: Kernel Combinations and Periodic Functions , 2019, UAI.

[59]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[60]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[61]  Robert Haining,et al.  Statistics for spatial data: by Noel Cressie, 1991, John Wiley & Sons, New York, 900 p., ISBN 0-471-84336-9, US $89.95 , 1993 .

[62]  Christopher K. I. Williams Computation with Infinite Neural Networks , 1998, Neural Computation.

[63]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[64]  Rae. Z.H. Aliyev,et al.  Interpolation of Spatial Data , 2018, Biomedical Journal of Scientific & Technical Research.

[65]  Yuan Yao,et al.  Mercer's Theorem, Feature Maps, and Smoothing , 2006, COLT.

[66]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[67]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[68]  Terrance E. Boult,et al.  Learning and the Unknown: Surveying Steps toward Open World Recognition , 2019, AAAI.

[69]  G. Fasshauer Positive definite kernels: past, present and future , 2011 .

[70]  Arno Solin,et al.  Variational Fourier Features for Gaussian Processes , 2016, J. Mach. Learn. Res..

[71]  Arno Solin,et al.  Spatio-Temporal Learning via Infinite-Dimensional Bayesian Filtering and Smoothing , 2013 .