How to Evaluate Uncertainty Estimates in Machine Learning for Regression?

As neural networks become more popular, the need for accompanying uncertainty estimates increases. The current testing methodology focusses on how good the predictive uncertainty estimates explain the differences between predictions and observations in a previously unseen test set. Intuitively this is a logical approach. The current setup of benchmark data sets also allows easy comparison between the different methods. We demonstrate, however, through both theoretical arguments and simulations that this way of evaluating the quality of uncertainty estimates has serious flaws. Firstly, it cannot disentangle the aleatoric from the epistemic uncertainty. Secondly, the current methodology considers the uncertainty averaged over all test samples, implicitly averaging out overconfident and underconfident predictions. When checking if the correct fraction of test points falls inside prediction intervals, a good score on average gives no guarantee that the intervals are sensible for individual points. We demonstrate through practical examples that these effects can result in favoring a method, based on the predictive uncertainty, that has undesirable behaviour of the confidence intervals. Finally, we propose a simulation-based testing approach that addresses these problems while still allowing easy comparison between different methods.

[1]  Tom Heskes,et al.  Practical Confidence and Prediction Intervals , 1996, NIPS.

[2]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[3]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[4]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[5]  Mohamed Zaki,et al.  High-Quality Prediction Intervals for Deep Learning: A Distribution-Free, Ensembled Approach , 2018, ICML.

[6]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[7]  Liwei Wang,et al.  The Expressive Power of Neural Networks: A View from the Width , 2017, NIPS.

[8]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[9]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[10]  Tim Pearce Uncertainty in Neural Networks; Bayesian Ensembles, Priors & Prediction Intervals , 2020 .

[11]  Amir F. Atiya,et al.  Comprehensive Review of Neural Network-Based Prediction Intervals and New Advances , 2011, IEEE Transactions on Neural Networks.

[12]  Marc Peter Deisenroth,et al.  Doubly Stochastic Variational Inference for Deep Gaussian Processes , 2017, NIPS.

[13]  Jose Olmo,et al.  Prediction intervals for Deep Neural Networks , 2020, ArXiv.

[14]  Dmitry Vetrov,et al.  Pitfalls of In-Domain Uncertainty Estimation and Ensembling in Deep Learning , 2020, ICLR.

[15]  Willem Waegeman,et al.  Aleatoric and Epistemic Uncertainty in Machine Learning: A Tutorial Introduction , 2019, ArXiv.

[16]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.