Ensemble Distribution Distillation

Ensembles of models often yield improvements in system performance. These ensemble approaches have also been empirically shown to yield robust measures of uncertainty, and are capable of distinguishing between different \emph{forms} of uncertainty. However, ensembles come at a computational and memory cost which may be prohibitive for many applications. There has been significant work done on the distillation of an ensemble into a single model. Such approaches decrease computational cost and allow a single model to achieve an accuracy comparable to that of an ensemble. However, information about the \emph{diversity} of the ensemble, which can yield estimates of different forms of uncertainty, is lost. This work considers the novel task of \emph{Ensemble Distribution Distillation} (EnD$^2$) --- distilling the distribution of the predictions from an ensemble, rather than just the average prediction, into a single model. EnD$^2$ enables a single model to retain both the improved classification performance of ensemble distillation as well as information about the diversity of the ensemble, which is useful for uncertainty estimation. A solution for EnD$^2$ based on Prior Networks, a class of models which allow a single neural network to explicitly model a distribution over output distributions, is proposed in this work. The properties of EnD$^2$ are investigated on both an artificial dataset, and on the CIFAR-10, CIFAR-100 and TinyImageNet datasets, where it is shown that EnD$^2$ can approach the classification performance of an ensemble, and outperforms both standard DNNs and Ensemble Distillation on the tasks of misclassification and out-of-distribution input detection.

[1]  Ruben Villegas,et al.  Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[2]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[3]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[4]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[5]  Andrey Malinin,et al.  Reverse KL-Divergence Training of Prior Networks: Improved Uncertainty and Adversarial Robustness , 2019, NeurIPS.

[6]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[7]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[8]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[10]  Mark J. F. Gales,et al.  Predictive Uncertainty Estimation via Prior Networks , 2018, NeurIPS.

[11]  Mark J. F. Gales,et al.  Sequence Teacher-Student Training of Acoustic Models for Automatic Free Speaking Language Assessment , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[12]  Geraint Rees,et al.  Clinically applicable deep learning for diagnosis and referral in retinal disease , 2018, Nature Medicine.

[13]  Masashi Sugiyama,et al.  Bayesian Dark Knowledge , 2015 .

[14]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[15]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[16]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[17]  Mark J. F. Gales,et al.  Incorporating Uncertainty into Deep Learning for Spoken Language Assessment , 2017, ACL.

[18]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[19]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[20]  Mark J. F. Gales,et al.  Multi-task ensembles with teacher-student training , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[21]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[22]  Andrew Gordon Wilson,et al.  Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.

[23]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[26]  Yinda Zhang,et al.  LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop , 2015, ArXiv.

[27]  Finale Doshi-Velez,et al.  Decomposition of Uncertainty for Active Learning and Reliable Reinforcement Learning in Stochastic Systems , 2017, ArXiv.

[28]  Johannes Gehrke,et al.  Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission , 2015, KDD.

[29]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[30]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[31]  Andrey Malinin,et al.  Uncertainty estimation in deep learning with application to spoken language assessment , 2019 .

[32]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[33]  George Papamakarios,et al.  Distilling Model Knowledge , 2015, ArXiv.

[34]  Tetsunori Kobayashi,et al.  Multiscale recurrent neural network based language model , 2015, INTERSPEECH.

[35]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[36]  Finale Doshi-Velez,et al.  Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning , 2017, ICML.

[37]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[38]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[39]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Vivek Rathod,et al.  Bayesian dark knowledge , 2015, NIPS.

[42]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[43]  Hossein Azizpour,et al.  Efficient Evaluation-Time Uncertainty Estimation by Improved Distillation , 2019, ICML 2019.

[44]  Yarin Gal,et al.  Understanding Measures of Uncertainty for Adversarial Example Detection , 2018, UAI.

[45]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[46]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.