Can You Trust This Prediction? Auditing Pointwise Reliability After Learning

To use machine learning in high stakes applications (e.g. medicine), we need tools for building confidence in the system and evaluating whether it is reliable. Methods to improve model reliability often require new learning algorithms (e.g. using Bayesian inference to obtain uncertainty estimates). An alternative is to audit a model after it is trained. In this paper, we describe resampling uncertainty estimation (RUE), an algorithm to audit the pointwise reliability of predictions. Intuitively, RUE estimates the amount that a prediction would change if the model had been fit on different training data. The algorithm uses the gradient and Hessian of the model's loss function to create an ensemble of predictions. Experimentally, we show that RUE more effectively detects inaccurate predictions than existing tools for auditing reliability subsequent to training. We also show that RUE can create predictive distributions that are competitive with state-of-the-art methods like Monte Carlo dropout, probabilistic backpropagation, and deep ensembles, but does not depend on specific algorithms at train-time like these methods do.

[1]  Suchi Saria,et al.  Counterfactual Normalization: Proactively Addressing Dataset Shift Using Causal Mechanisms , 2018, UAI.

[2]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[3]  Richard E. Turner,et al.  Black-box α-divergence minimization , 2016, ICML 2016.

[4]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[5]  David Barber,et al.  Practical Gauss-Newton Optimisation for Deep Learning , 2017, ICML.

[6]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[7]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[8]  Lyle H. Ungar,et al.  A NEURAL NETWORK ARCHITECTURE THAT COMPUTES ITS OWN RELIABILITY , 1992 .

[9]  Bernhard Schölkopf,et al.  Support Vector Method for Novelty Detection , 1999, NIPS.

[10]  Radu Herbei,et al.  Classification with reject option , 2006 .

[11]  Suchi Saria,et al.  Tutorial: Safe and Reliable Machine Learning , 2019, ArXiv.

[12]  Alexandra Chouldechova,et al.  Does mitigating ML's impact disparity require treatment disparity? , 2017, NeurIPS.

[13]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Ahn Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring , 2012 .

[15]  Andrea Vedaldi,et al.  Understanding deep image representations by inverting them , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[17]  Terrance E. Boult,et al.  Towards Open Set Deep Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[20]  Eric Horvitz,et al.  Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure , 2018, HCOMP.

[21]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[22]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[23]  Suchi Saria,et al.  Preventing Failures Due to Dataset Shift: Learning Predictive Models That Transport , 2018, AISTATS.

[24]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[25]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[26]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[27]  C. K. Chow,et al.  On optimum recognition error and reject tradeoff , 1970, IEEE Trans. Inf. Theory.

[28]  Thomas G. Dietterich,et al.  Open Category Detection with PAC Guarantees , 2018, ICML.

[29]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[30]  Christopher M. Bishop,et al.  Novelty detection and neural network validation , 1994 .

[31]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[32]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[33]  Suchi Saria,et al.  Reliable Decision Support using Counterfactual Models , 2017, NIPS.

[34]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[35]  Yves Grandvalet,et al.  Support Vector Machines with a Reject Option , 2008, NIPS.

[36]  Pierre Courrieu,et al.  Three algorithms for estimating the domain of validity of feedforward neural networks , 1994, Neural Networks.

[37]  David Barber,et al.  A Scalable Laplace Approximation for Neural Networks , 2018, ICLR.

[38]  J. Kent Robust properties of likelihood ratio tests , 1982 .

[39]  Robert Tibshirani,et al.  A Comparison of Some Error Estimates for Neural Network Models , 1996, Neural Computation.

[40]  Giles Hooker Diagnosing extrapolation: tree-based density estimation , 2004, KDD '04.

[41]  R. Cook Assessment of Local Influence , 1986 .

[42]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[43]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[44]  Oluwasanmi Koyejo,et al.  Examples are not enough, learn to criticize! Criticism for Interpretability , 2016, NIPS.

[45]  Peter L. Bartlett,et al.  Classification with a Reject Option using a Hinge Loss , 2008, J. Mach. Learn. Res..

[46]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[47]  Suchi Saria,et al.  Scalable Joint Models for Reliable Uncertainty-Aware Event Prediction , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[49]  Robert Tibshirani,et al.  Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy , 1986 .

[50]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[51]  David M. Blei,et al.  Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models , 2014 .

[52]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.