Economical ensembles with hypernetworks

Averaging the predictions of many independently trained neural networks is a simple and effective way of improving generalization in deep learning. However, this strategy rapidly becomes costly, as the number of trainable parameters grows linearly with the size of the ensemble. Here, we propose a new method to learn economical ensembles, where the number of trainable parameters and iterations over the data is comparable to that of a single model. Our neural networks are parameterized by hypernetworks, which learn to embed weights in low-dimensional spaces. In a late training phase, we generate an ensemble by randomly initializing an additional number of weight embeddings in the vicinity of each other. We then exploit the inherent randomness in stochastic gradient descent to induce ensemble diversity. Experiments with wide residual networks on the CIFAR and Fashion-MNIST datasets show that our algorithm yields models that are more accurate and less overconfident on unseen data, while learning as efficiently as a single network.

[1]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[2]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[3]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[4]  Graham W. Taylor,et al.  Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[5]  Anthony M. Zador,et al.  A critique of pure learning and what artificial neural networks can learn from animal brains , 2019, Nature Communications.

[6]  R. Srikant,et al.  Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks , 2017, ICLR.

[7]  Andrew K. Lampinen,et al.  Transforming task representations to allow deep learning models to perform novel tasks , 2020, ArXiv.

[8]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[9]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[10]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[11]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[12]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[13]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[14]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[15]  Yinda Zhang,et al.  LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop , 2015, ArXiv.

[16]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[17]  Benjamin Van Roy,et al.  Hypermodels for Exploration , 2020, ICLR.

[18]  Adam Gaier,et al.  Weight Agnostic Neural Networks , 2019, NeurIPS.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Gregory Cohen,et al.  EMNIST: Extending MNIST to handwritten letters , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[21]  Razvan Pascanu,et al.  Meta-Learning with Latent Embedding Optimization , 2018, ICLR.

[22]  Michael Maire,et al.  Learning Implicitly Recurrent CNNs Through Parameter Sharing , 2019, ICLR.

[23]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[25]  Michael Cogswell,et al.  Why M Heads are Better than One: Training a Diverse Ensemble of Deep Networks , 2015, ArXiv.

[26]  Andrew Gordon Wilson,et al.  Subspace Inference for Bayesian Deep Learning , 2019, UAI.

[27]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[28]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[29]  Konrad P. Körding,et al.  What does it mean to understand a neural network? , 2019, ArXiv.

[30]  Oleksandr Makeyev,et al.  Neural network with ensembles , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[31]  Dustin Tran,et al.  BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning , 2020, ICLR.

[32]  Risto Miikkulainen,et al.  Designing neural networks through neuroevolution , 2019, Nat. Mach. Intell..

[33]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[34]  Max Welling,et al.  Multiplicative Normalizing Flows for Variational Bayesian Neural Networks , 2017, ICML.

[35]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[36]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[37]  Xin Yao,et al.  Ensemble learning via negative correlation , 1999, Neural Networks.

[38]  Carlo Baldassi,et al.  Shaping the learning landscape in neural networks around wide flat minima , 2019, Proceedings of the National Academy of Sciences.

[39]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[40]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[41]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[42]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[43]  Jim Williams,et al.  What Does It Mean? , 1907, California state journal of medicine.

[44]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[45]  K. S. Brown,et al.  Statistical mechanical approaches to models with many poorly known parameters. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[46]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[47]  Kibok Lee,et al.  A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks , 2018, NeurIPS.

[48]  Geoffrey E. Hinton,et al.  Large scale distributed neural network training through online distillation , 2018, ICLR.

[49]  Andrew Gordon Wilson,et al.  Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.

[50]  Ben Glocker,et al.  Implicit Weight Uncertainty in Neural Networks. , 2017 .

[51]  Alex Lamb,et al.  Deep Learning for Classical Japanese Literature , 2018, ArXiv.

[52]  Kilian Q. Weinberger,et al.  Snapshot Ensembles: Train 1, get M for free , 2017, ICLR.

[53]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[54]  Jürgen Schmidhuber,et al.  Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , 1992, Neural Computation.

[55]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[56]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[57]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[58]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[59]  Masashi Sugiyama,et al.  Bayesian Dark Knowledge , 2015 .

[60]  Yann Ollivier,et al.  The Description Length of Deep Learning models , 2018, NeurIPS.

[61]  Benjamin F. Grewe,et al.  Continual learning with hypernetworks , 2019, ICLR.

[62]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[63]  Yann LeCun,et al.  Transforming Neural-Net Output Levels to Probability Distributions , 1990, NIPS.