Bootstrapping and multiple imputation ensemble approaches for classification problems

Presence of missing values in a dataset can adversely affect the performance of a classifier. Single and Multiple Imputation are normally performed to fill in the missing values. In this paper, we present several variants of combining single and multiple imputation with bootstrapping to create ensembles that can model uncertainty and diversity in the data, and that are robust to high missingness in the data. We present three ensemble strategies: bootstrapping on incomplete data followed by (i) single imputation and (ii) multiple imputation, and (iii) multiple imputation ensemble without bootstrapping. We perform an extensive evaluation of the performance of the these ensemble strategies on eight datasets by varying the missingness ratio. Our results show that bootstrapping followed by multiple imputation using expectation maximization is the most robust method even at high missingness ratio (up to 30%). For small missingness ratio (up to 10%) most of the ensemble methods perform equivalently but better than single imputation. Kappa-error plots suggest that accurate classifiers with reasonable diversity is the reason for this behaviour. A consistent observation in all the datasets suggests that for small missingness (up to 10%), bootstrapping on incomplete data without any imputation produces equivalent results to other ensemble methods.

[1]  A. J. Feelders,et al.  Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation , 1999, PKDD.

[2]  Gerhard Tröster,et al.  Using ensemble classifier systems for handling missing data in emotion recognition from physiology: One step towards a practical system , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[3]  V. Kumutha,et al.  An enhanced approach on handling missing values using bagging k-NN imputation , 2013, 2013 International Conference on Computer Communication and Informatics.

[4]  Xiao-Hua Zhou,et al.  Multiple imputation: review of theory, implementation and software , 2007, Statistics in medicine.

[5]  Loris Nanni,et al.  A classifier ensemble approach for the missing feature problem , 2012, Artif. Intell. Medicine.

[6]  “ Multiple Imputation in Practice : Comparison of Software Packages for Regression Models With Missing Variables , ” , 2002 .

[7]  Bhekisipho Twala,et al.  Ensemble imputation methods for missing software engineering data , 2005, 11th IEEE International Software Metrics Symposium (METRICS'05).

[8]  Taghi M. Khoshgoftaar,et al.  Making an accurate classifier ensemble by voting on classifications from imputed learning sets , 2009, Int. J. Inf. Decis. Sci..

[9]  Shehroz S. Khan,et al.  Bayesian Multiple Imputation Approaches for One-Class Classification , 2012, Canadian Conference on AI.

[10]  Andrew Gelman,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2006 .

[11]  Mickael Guedj,et al.  A Comparison of Six Methods for Missing Data Imputation , 2015 .

[12]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[13]  Stefan Van Aelst,et al.  Tree-based prediction on incomplete data using imputation or surrogate decisions , 2015, Inf. Sci..

[14]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[15]  Borja Calvo,et al.  scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems , 2016, R J..

[16]  Larry J. Eshelman,et al.  A dynamic ensemble approach to robust classification in the presence of missing data , 2015, Machine Learning.

[17]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[18]  M. Baneshi,et al.  Assessment of Internal Validity of Prognostic Models through Bootstrapping and Multiple Imputation of Missing Data , 2012, Iranian journal of public health.

[19]  Fan Jia,et al.  A New Procedure to Test Mediation With Missing Data Through Nonparametric Bootstrapping and Multiple Imputation , 2013, Multivariate behavioral research.

[20]  William Eberle,et al.  Data preprocessing issues for incomplete medical datasets , 2016, Expert Syst. J. Knowl. Eng..

[21]  Thomas G. Dietterich,et al.  Pruning Adaptive Boosting , 1997, ICML.

[22]  Søren Feodor Nielsen,et al.  Inference and Missing Data: Asymptotic Results , 1997 .

[23]  Shashi Dahiya,et al.  A feature selection enabled hybrid‐bagging algorithm for credit risk evaluation , 2017, Expert Syst. J. Knowl. Eng..

[24]  Juan José Rodríguez Diez,et al.  Classifier Ensembles with a Random Linear Oracle , 2007, IEEE Transactions on Knowledge and Data Engineering.

[25]  Dongfeng Wu,et al.  A Bayesian nonlinear mixed-effects disease progression model. , 2015, Journal of biometrics & biostatistics.

[26]  Bing Xue,et al.  Proceedings in Adaptation, Learning and Optimization , 2016, IES.

[27]  Michael Schomaker,et al.  Bootstrap inference when using multiple imputation , 2016, Statistics in medicine.

[28]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[29]  Bhekisipho Twala,et al.  Ensemble missing data techniques for software effort prediction , 2010, Intell. Data Anal..

[30]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[31]  Amir F. Atiya,et al.  Regression in the Presence Missing Data Using Ensemble Methods , 2007, 2007 International Joint Conference on Neural Networks.

[32]  Panteha Hayati Rezvan,et al.  A review of the reporting and implementation of multiple imputation in medical research , 2015 .

[33]  Ting Hsiang Lin,et al.  A comparison of multiple imputation with EM algorithm and MCMC method for quality of life missing data , 2010 .

[34]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[35]  T. Stijnen,et al.  Review: a gentle introduction to imputation of missing values. , 2006, Journal of clinical epidemiology.