Distilling Ensembles Improves Uncertainty Estimates

We seek to bridge the performance gap between batch ensembles (ensembles of deep networks with shared parameters) and deep ensembles on tasks which require not only predictions, but also uncertainty estimates for these predictions. We obtain negative theoretical results on the possibility of approximating deep ensemble weights by batch ensemble weights, and so turn to distillation. Training a batch ensemble on the outputs of deep ensembles improves accuracy and uncertainty estimates, without requiring hyper-parameter tuning. This result is specific to the choice of batch ensemble architectures: distilling deep ensembles to a single network is unsuccessful, despite single networks having only marginally fewer parameters than batch ensembles.

[1]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[2]  Jason Hickey,et al.  Machine Learning for Precipitation Nowcasting from Radar Images , 2019, ArXiv.

[3]  Ryan P. Adams,et al.  Early Stopping as Nonparametric Variational Inference , 2015, AISTATS.

[4]  Sebastian Nowozin,et al.  How Good is the Bayes Posterior in Deep Neural Networks Really? , 2020, ICML.

[5]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[6]  Masashi Sugiyama,et al.  Bayesian Dark Knowledge , 2015 .

[7]  A. Masegosa Learning from i.i.d. data under model miss-specification , 2019, ArXiv.

[8]  Yee Whye Teh,et al.  Neural Ensemble Search for Performant and Calibrated Predictions , 2020, ArXiv.

[9]  Xu Lan,et al.  Knowledge Distillation by On-the-Fly Native Ensemble , 2018, NeurIPS.

[10]  Oleksandr Makeyev,et al.  Neural network with ensembles , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[11]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[12]  Mark J. F. Gales,et al.  Predictive Uncertainty Estimation via Prior Networks , 2018, NeurIPS.

[13]  Zoubin Ghahramani,et al.  Compact approximations to Bayesian predictive distributions , 2005, ICML.

[14]  Jasper Snoek,et al.  Hyperparameter Ensembles for Robustness and Uncertainty Quantification , 2020, NeurIPS.

[15]  Sebastian Nowozin,et al.  Hydra: Preserving Ensemble Diversity for Model Distillation , 2020, ArXiv.

[16]  Dustin Tran,et al.  BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning , 2020, ICLR.

[17]  Andrew Gordon Wilson,et al.  Bayesian Deep Learning and a Probabilistic Perspective of Generalization , 2020, NeurIPS.

[18]  Andrey Malinin,et al.  Ensemble Distribution Distillation , 2019, ICLR.

[19]  Mohammad Emtiyaz Khan,et al.  Practical Deep Learning with Bayesian Principles , 2019, NeurIPS.

[20]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[21]  Thomas B. Schön,et al.  Evaluating Scalable Bayesian Deep Learning Methods for Robust Computer Vision , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[22]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[23]  Subhashini Venugopalan,et al.  Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. , 2016, JAMA.

[24]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[25]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.