Greedy Bayesian Posterior Approximation with Deep Ensembles

Ensembles of independently trained neural networks are a state-of-the-art approach to estimate predictive uncertainty in Deep Learning, and can be interpreted as an approximation of the posterior distribution via a mixture of delta functions. The training of ensembles relies on non-convexity of the loss landscape and random initialization of their individual members, making the resulting posterior approximation uncontrolled. This paper proposes a novel and principled method to tackle this limitation, minimizing an f -divergence between the true posterior and a kernel density estimator in a function space. We analyze this objective from a combinatorial point of view, and show that it is submodular with respect to mixture components for any f . Subsequently, we consider the problem of greedy ensemble construction, and from the marginal gain of the total objective, we derive a novel diversity term for ensemble methods. The performance of our approach is demonstrated on computer vision out-of-distribution benchmarks in a range of architectures trained on multiple datasets. The source code of our method is publicly available at https://github.com/MIPT-Oulu/greedy_ensembles_training.

[1]  Matthew B. Blaschko,et al.  Function Norms for Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[2]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Jeremy Nixon,et al.  Measuring Calibration in Deep Learning , 2019, CVPR Workshops.

[5]  Yinda Zhang,et al.  LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop , 2015, ArXiv.

[6]  Hai Li,et al.  DVERGE: Diversifying Vulnerabilities for Enhanced Robust Generation of Ensembles , 2020, NeurIPS.

[7]  Dilin Wang,et al.  Nonlinear Stein Variational Gradient Descent for Learning Diversified Mixture Models , 2019, ICML.

[8]  Andrew Gordon Wilson,et al.  Subspace Inference for Bayesian Deep Learning , 2019, UAI.

[9]  Tim Pearce,et al.  Uncertainty in Neural Networks: Approximately Bayesian Ensembling , 2018, AISTATS.

[10]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[11]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[12]  N. Kazarinoff Analytic Inequalities , 2021, Inequalities in Analysis and Probability.

[13]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[14]  Jan Vondrák,et al.  Submodular maximization by simulated annealing , 2010, SODA '11.

[15]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[16]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[17]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[18]  Francis Bach,et al.  Submodular functions: from discrete to continuous domains , 2015, Mathematical Programming.

[19]  Arno Solin,et al.  Stationary Activations for Uncertainty Calibration in Deep Learning , 2020, NeurIPS.

[20]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[21]  Thomas G. Dietterich,et al.  Deep Anomaly Detection with Outlier Exposure , 2018, ICLR.

[22]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[23]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[24]  Oleksandr Makeyev,et al.  Neural network with ensembles , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[25]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[26]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[27]  Joseph Naor,et al.  Submodular Maximization with Cardinality Constraints , 2014, SODA.

[28]  Dmitry Vetrov,et al.  Pitfalls of In-Domain Uncertainty Estimation and Ensembling in Deep Learning , 2020, ICLR.

[29]  Fei Sha,et al.  Uncertainty Estimation with Infinitesimal Jackknife, Its Distribution and Mean-Field Approximation , 2020, ArXiv.

[30]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[31]  Aoying Zhou,et al.  Ensemble Pruning: A Submodular Function Maximization Perspective , 2014, DASFAA.

[32]  Lisa Fleischer,et al.  Submodular Approximation: Sampling-based Algorithms and Lower Bounds , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[33]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[34]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[35]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[36]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[37]  Dustin Tran,et al.  BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning , 2020, ICLR.

[38]  Andrew Gordon Wilson,et al.  Bayesian Deep Learning and a Probabilistic Perspective of Generalization , 2020, NeurIPS.

[39]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[42]  Andrey Malinin,et al.  Ensemble Distribution Distillation , 2019, ICLR.

[43]  Finale Doshi-Velez,et al.  Ensembles of Locally Independent Prediction Models , 2020, AAAI.

[44]  Masashi Sugiyama,et al.  Bayesian Posterior Approximation via Greedy Particle Optimization , 2018, AAAI.

[45]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[46]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[47]  Nicholay Topin,et al.  Super-convergence: very fast training of neural networks using large learning rates , 2018, Defense + Commercial Sensing.

[48]  Yee Whye Teh,et al.  Neural Ensemble Search for Performant and Calibrated Predictions , 2020, ArXiv.

[49]  Peter Tiño,et al.  Managing Diversity in Regression Ensembles , 2005, J. Mach. Learn. Res..

[50]  Kunihiko Fukushima,et al.  Neocognitron: A hierarchical neural network capable of visual pattern recognition , 1988, Neural Networks.

[51]  Joost R. van Amersfoort,et al.  Simple and Scalable Epistemic Uncertainty Estimation Using a Single Deep Deterministic Neural Network , 2020, ICML 2020.

[52]  Matthieu Cord,et al.  DICE: Diversity in Deep Ensembles via Conditional Redundancy Adversarial Estimation , 2021, ICLR.

[53]  Mark J. F. Gales,et al.  Predictive Uncertainty Estimation via Prior Networks , 2018, NeurIPS.

[54]  Raymond J. Mooney,et al.  Creating diversity in ensembles using artificial data , 2005, Inf. Fusion.

[55]  Iasonas Kokkinos,et al.  Describing Textures in the Wild , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[57]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[58]  Finale Doshi-Velez,et al.  Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning , 2017, ICML.

[59]  Naira Hovakimyan,et al.  f-Divergence Variational Inference , 2020, NeurIPS.

[60]  Ya Le,et al.  Tiny ImageNet Visual Recognition Challenge , 2015 .