Handling incomplete data classification using imputed feature selected bagging (IFBag) method

Almost all real-world datasets contain missing values. Classification of data with missing values can adversely affect the performance of a classifier if not handled correctly. A common approach used for classification with incomplete data is imputation. Imputation transforms incomplete data with missing values to complete data. Single imputation methods are mostly less accurate than multiple imputation methods which are often computationally much more expensive. This study proposes an imputed feature selected bagging (IFBag) method which uses multiple imputation, feature selection and bagging ensemble learning approach to construct a number of base classifiers to classify new incomplete instances without any need for imputation in testing phase. In bagging ensemble learning approach, data is resampled multiple times with substitution, which can lead to diversity in data thus resulting in more accurate classifiers. The experimental results show the proposed IFBag method is considerably fast and gives 97.26% accuracy for classification with incomplete data as compared to common methods used.

[1]  Md Zahidul Islam,et al.  FIMUS: A framework for imputing missing values using co-appearance, correlation and similarity analysis , 2014, Knowl. Based Syst..

[2]  Claudio De Stefano,et al.  A ranking-based feature selection approach for handwritten character recognition , 2019, Pattern Recognit. Lett..

[3]  Taghi M. Khoshgoftaar,et al.  Making an accurate classifier ensemble by voting on classifications from imputed learning sets , 2009, Int. J. Inf. Decis. Sci..

[4]  Mengjie Zhang,et al.  Improving performance for classification with incomplete data using wrapper-based feature selection , 2016, Evol. Intell..

[5]  Mengjie Zhang,et al.  An effective and efficient approach to classification with incomplete data , 2018, Knowl. Based Syst..

[6]  Anindya Halder,et al.  R-Ensembler: A greedy rough set based ensemble attribute selection algorithm with kNN imputation for classification of medical data , 2020, Comput. Methods Programs Biomed..

[7]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[8]  Mengjie Zhang,et al.  Improving performance of classification on incomplete data using feature selection and clustering , 2018, Appl. Soft Comput..

[9]  Loris Nanni,et al.  A classifier ensemble approach for the missing feature problem , 2012, Artif. Intell. Medicine.

[10]  Bhekisipho Twala,et al.  Ensemble missing data techniques for software effort prediction , 2010, Intell. Data Anal..

[11]  Yao Zhang,et al.  Feature selection based on conditional mutual information: minimum conditional relevance and minimum conditional redundancy , 2018, Applied Intelligence.

[12]  Shehroz S. Khan,et al.  Bootstrapping and multiple imputation ensemble approaches for classification problems , 2019, J. Intell. Fuzzy Syst..

[13]  Michel Verleysen,et al.  Feature selection with missing data using mutual information estimators , 2012, Neurocomputing.

[14]  R. Devi Priya,et al.  Heuristically repopulated Bayesian ant colony optimization for treating missing values in large databases , 2017, Knowl. Based Syst..

[15]  Alexander D. Stead,et al.  The case for the use of multiple imputation missing data methods in stochastic frontier analysis with illustration using English local highway data , 2020, Eur. J. Oper. Res..

[16]  Amit Acharya,et al.  MICE vs PPCA: Missing data imputation in healthcare , 2019, Informatics in Medicine Unlocked.

[17]  Patrick Royston,et al.  Multiple imputation using chained equations: Issues and guidance for practice , 2011, Statistics in medicine.

[18]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[19]  M. Baneshi,et al.  Assessment of Internal Validity of Prognostic Models through Bootstrapping and Multiple Imputation of Missing Data , 2012, Iranian journal of public health.

[20]  Hongzhi Wang,et al.  Incomplete data classification with view-based decision tree , 2020, Appl. Soft Comput..

[21]  Constantine Frangakis,et al.  Multiple imputation by chained equations: what is it and how does it work? , 2011, International journal of methods in psychiatric research.

[22]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[23]  Gavin Brown,et al.  Learn++.MF: A random subspace approach for the missing feature problem , 2010, Pattern Recognit..

[24]  Mengjie Zhang,et al.  Differential evolution for filter feature selection based on information theory and feature ranking , 2018, Knowl. Based Syst..

[25]  Michael Schomaker,et al.  Bootstrap inference when using multiple imputation , 2016, Statistics in medicine.

[26]  C. Ravindranath Chowdary,et al.  A-Stacking and A-Bagging: Adaptive versions of ensemble learning algorithms for spoof fingerprint detection , 2020, Expert Syst. Appl..

[27]  Jie Sun,et al.  Dynamic financial distress prediction with concept drift based on time weighting combined with Adaboost support vector machine ensemble , 2017, Knowl. Based Syst..

[28]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[29]  Yiwen Zhang,et al.  A selective neural network ensemble classification for incomplete data , 2016, International Journal of Machine Learning and Cybernetics.

[30]  Shulin Wang,et al.  Feature selection in machine learning: A new perspective , 2018, Neurocomputing.

[31]  Wei Lu,et al.  Imputations of missing values using a tracking-removed autoencoder trained with incomplete data , 2019, Neurocomputing.

[32]  Wenhao Shu,et al.  Mutual information criterion for feature selection from incomplete data , 2015, Neurocomputing.

[33]  Basav Roychoudhury,et al.  Handling missing values: A study of popular imputation packages in R , 2018, Knowl. Based Syst..

[34]  Xiwang Li,et al.  Using an ensemble machine learning methodology-Bagging to predict occupants’ thermal comfort in buildings , 2018, Energy and Buildings.

[35]  Min Gan,et al.  Information-decomposition-model-based missing value estimation for not missing at random dataset , 2015, International Journal of Machine Learning and Cybernetics.

[36]  Stefan Van Aelst,et al.  Tree-based prediction on incomplete data using imputation or surrogate decisions , 2015, Inf. Sci..

[37]  Kamal Medjaher,et al.  Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset , 2019, Expert Syst. Appl..