EnsembleNet: End-to-End Optimization of Multi-headed Models

Ensembling is a universally useful approach to boost the performance of machine learning models. However, individual models in an ensemble are typically trained independently in separate stages, without information access about the overall ensemble. In this paper, model ensembles are treated as first-class citizens, and their performance is optimized end-to-end with parameter sharing and a novel loss structure that improves generalization. On large-scale datasets including ImageNet, Youtube-8M, and Kinetics, we demonstrate a procedure that starts from a strongly performing single deep neural network, and constructs an EnsembleNet that has both a smaller size and better performance. Moreover, an EnsembleNet can be trained in one stage just like a single model without manual intervention.

[1]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[2]  Guocong Song,et al.  Collaborative Learning for Deep Neural Networks , 2018, NeurIPS.

[3]  Michael Cogswell,et al.  Why M Heads are Better than One: Training a Diverse Ensemble of Deep Networks , 2015, ArXiv.

[4]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[5]  Hyun-Chul Kim,et al.  Bayesian Classifier Combination , 2012, AISTATS.

[6]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[8]  Grigorios Tsoumakas,et al.  A Study on Greedy Algorithms for Ensemble Pruning , 2012 .

[9]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[10]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[11]  Geoffrey E. Hinton,et al.  Large scale distributed neural network training through online distillation , 2018, ICLR.

[12]  Thomas P. Minka,et al.  Bayesian model averaging is not model combination , 2002 .

[13]  Tony R. Martinez,et al.  Turning Bayesian model averaging into Bayesian model combination , 2011, The 2011 International Joint Conference on Neural Networks.

[14]  Mehryar Mohri,et al.  AdaNet: Adaptive Structural Learning of Artificial Neural Networks , 2016, ICML.

[15]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[16]  Zhuowen Tu,et al.  Deeply-Supervised Nets , 2014, AISTATS.

[17]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[20]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[21]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Trevor Darrell,et al.  Deep Mixture of Experts via Shallow Embedding , 2018, UAI.

[23]  Xu Lan,et al.  Knowledge Distillation by On-the-Fly Native Ensemble , 2018, NeurIPS.

[24]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[25]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[27]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[28]  Pedro M. Domingos Bayesian Averaging of Classifiers and the Overfitting Problem , 2000, ICML.

[29]  Andrew G. Howard,et al.  Some Improvements on Deep Convolutional Neural Network Based Image Classification , 2013, ICLR.

[30]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[31]  Huchuan Lu,et al.  Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[33]  Jianping Fan,et al.  NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification , 2018, ECCV Workshops.

[34]  Yang Yu,et al.  Diversity Regularized Ensemble Pruning , 2012, ECML/PKDD.

[35]  Ivan Laptev,et al.  Learnable pooling with Context Gating for video classification , 2017, ArXiv.