The relative performance of ensemble methods with deep convolutional neural networks for image classification

Artificial neural networks have been successfully applied to a variety of machine learning tasks, including image recognition, semantic segmentation, and machine translation. However, few studies fully investigated ensembles of artificial neural networks. In this work, we investigated multiple widely used ensemble methods, including unweighted averaging, majority voting, the Bayes Optimal Classifier, and the (discrete) Super Learner, for image recognition tasks, with deep neural networks as candidate algorithms. We designed several experiments, with the candidate algorithms being the same network structure with different model checkpoints within a single training process, networks with same structure but trained multiple times stochastically, and networks with different structure. In addition, we further studied the overconfidence phenomenon of the neural networks, as well as its impact on the ensemble methods. Across all of our experiments, the Super Learner achieved best performance among all the ensemble methods in this study.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Robert P. W. Duin,et al.  Limits on the majority vote accuracy in classifier fusion , 2003, Pattern Analysis & Applications.

[3]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Mark J. van der Laan,et al.  Optimal Spatial Prediction Using Ensemble Machine Learning , 2016, The international journal of biostatistics.

[5]  Mark J van der Laan,et al.  Super Learning: An Application to the Prediction of HIV-1 Drug Resistance , 2007, Statistical applications in genetics and molecular biology.

[6]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[7]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Wen Gao,et al.  Hierarchical Ensemble of Global and Local Classifiers for Face Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[10]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[11]  Serge J. Belongie,et al.  Residual Networks are Exponential Ensembles of Relatively Shallow Networks , 2016, ArXiv.

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[15]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[16]  Fabio A. González,et al.  Cascaded ensemble of convolutional neural networks and handcrafted features for mitosis detection , 2014, Medical Imaging.

[17]  Cheng Ju,et al.  Using Super Learner Prediction Modeling to Improve High-dimensional Propensity Score Estimation , 2018, Epidemiology.

[18]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[19]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[21]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[22]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[23]  Cheng Ju,et al.  Propensity score prediction for electronic healthcare databases using super learner and high-dimensional propensity score methods , 2017, Journal of applied statistics.

[24]  James O. Berger,et al.  Combining Independent Normal Mean Estimation Problems with Unknown Variances , 1976 .

[25]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[26]  M. J. van der Laan,et al.  Statistical Applications in Genetics and Molecular Biology Super Learner , 2010 .

[27]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[28]  S. Dudoit,et al.  Unified Cross-Validation Methodology For Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples , 2003 .

[29]  M. J. Laan,et al.  Data-adaptive Inference of the Optimal Treatment Rule and its Mean Reward. The Masked Bandit , 2016 .

[30]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[31]  B. Efron,et al.  Combining Possibly Related Estimation Problems , 1973 .

[32]  Cheng Ju,et al.  Online cross‐validation‐based ensemble learning , 2018, Statistics in medicine.

[33]  P. Bühlmann,et al.  Survival ensembles. , 2006, Biostatistics.

[34]  J. Rao,et al.  Combining Independent Estimators and Estimation in Linear Regression with Unequal Variances , 1971 .

[35]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[36]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[37]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[38]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[39]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[40]  M. J. van der Laan,et al.  Super-Learning of an Optimal Dynamic Treatment Rule , 2016, The international journal of biostatistics.

[41]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[42]  William E. Strawderman,et al.  A James-Stein Type Estimator for Combining Unbiased and Possibly Biased Estimators , 1991 .

[43]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[44]  Serge J. Belongie,et al.  Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[45]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[46]  Mark J. van der Laan,et al.  Super Learner In Prediction , 2010 .

[47]  Donald B. Rubin,et al.  THE VARIANCE OF A LINEAR COMBINATION OF INDEPENDENT ESTIMATORS USING ESTIMATED WEIGHTS , 1974 .

[48]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[49]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[50]  M. J. van der Laan,et al.  Mortality prediction in intensive care units with the Super ICU Learner Algorithm (SICULA): a population-based study. , 2015, The Lancet. Respiratory medicine.

[51]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.