An effective and efficient approach to classification with incomplete data

Abstract Many real-world datasets suffer from the unavoidable issue of missing values. Classification with incomplete data has to be carefully handled because inadequate treatment of missing values will cause large classification errors. Using imputation to transform incomplete data into complete data is a common approach to classification with incomplete data. However, simple imputation methods are often not accurate, and powerful imputation methods are usually computationally intensive. A recent approach to handling incomplete data constructs an ensemble of classifiers, each tailored to a known pattern of missing data. The main advantage of this approach is that it can classify new incomplete instances without requiring any imputation. This paper proposes an improvement on the ensemble approach by integrating imputation and genetic-based feature selection. The imputation creates higher quality training data. The feature selection reduces the number of missing patterns which increases the speed of classification, and greatly increases the fraction of new instances that can be classified by the ensemble. The results of experiments show that the proposed method is more accurate, and faster than previous common methods for classification with incomplete data.

[1]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[2]  Wenhao Shu,et al.  Mutual information criterion for feature selection from incomplete data , 2015, Neurocomputing.

[3]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[4]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[5]  Michel Verleysen,et al.  Feature selection with missing data using mutual information estimators , 2012, Neurocomputing.

[6]  Dimitrios Gunopulos,et al.  Feature selection for the naive bayesian classifier using decision trees , 2003, Appl. Artif. Intell..

[7]  Ivan G. Costa,et al.  Impact of missing data imputation methods on gene expression clustering and classification , 2015, BMC Bioinformatics.

[8]  Yiwen Zhang,et al.  Multi-granulation Ensemble Classification for Incomplete Data , 2014, RSKT.

[9]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[10]  Amaury Lendasse,et al.  Regularized extreme learning machine for regression with missing data , 2013, Neurocomputing.

[11]  Nikhil R. Pal,et al.  Genetic programming for simultaneous feature selection and classifier design , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[12]  Jie Sun,et al.  Dynamic financial distress prediction with concept drift based on time weighting combined with Adaboost support vector machine ensemble , 2017, Knowl. Based Syst..

[13]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[14]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[15]  Md Zahidul Islam,et al.  FIMUS: A framework for imputing missing values using co-appearance, correlation and similarity analysis , 2014, Knowl. Based Syst..

[16]  Ausama Al-Sahaf,et al.  Automatically Evolving Rotation-Invariant Texture Image Descriptors by Genetic Programming , 2017, IEEE Transactions on Evolutionary Computation.

[17]  Mengjie Zhang,et al.  Multiple imputation and genetic programming for classification with incomplete data , 2017, GECCO.

[18]  Edgar Acuña,et al.  The Treatment of Missing Values and its Effect on Classifier Accuracy , 2004 .

[19]  Mengjie Zhang,et al.  Genetic programming based feature construction for classification with incomplete data , 2017, GECCO.

[20]  Francisco Herrera,et al.  A Survey on the Application of Genetic Programming to Classification , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[21]  Peter K. Sharpe,et al.  Dealing with missing values in neural network-based diagnostic systems , 1995, Neural Computing & Applications.

[22]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[23]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[24]  Mengjie Zhang,et al.  Improving performance for classification with incomplete data using wrapper-based feature selection , 2016, Evol. Intell..

[25]  Francisco Herrera,et al.  On the choice of the best imputation methods for missing values considering three groups of classification methods , 2012, Knowledge and Information Systems.

[26]  R. Devi Priya,et al.  Heuristically repopulated Bayesian ant colony optimization for treating missing values in large databases , 2017, Knowl. Based Syst..

[27]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[28]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Hamido Fujita,et al.  Imbalanced enterprise credit evaluation with DTE-SBD: Decision tree ensemble based on SMOTE and bagging with differentiated sampling rates , 2018, Inf. Sci..

[30]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[31]  Robi Polikar,et al.  An ensemble of classifiers approach for the missing feature problem , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[32]  Lukasz A. Kurgan,et al.  Impact of imputation of missing values on classification error for discrete data , 2008, Pattern Recognit..

[33]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[34]  Xin Yao,et al.  A Survey on Evolutionary Computation Approaches to Feature Selection , 2016, IEEE Transactions on Evolutionary Computation.

[35]  Gavin Brown,et al.  Learn++.MF: A random subspace approach for the missing feature problem , 2010, Pattern Recognit..

[36]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[37]  Yiwen Zhang,et al.  A selective neural network ensemble classification for incomplete data , 2016, International Journal of Machine Learning and Cybernetics.

[38]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.

[39]  Kai Jiang,et al.  Classification for Incomplete Data Using Classifier Ensembles , 2005, 2005 International Conference on Neural Networks and Brain.

[40]  Patrick Royston,et al.  Multiple imputation using chained equations: Issues and guidance for practice , 2011, Statistics in medicine.

[41]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[42]  Mengjie Zhang,et al.  Multiclass Object Classification Using Genetic Programming , 2004, EvoWorkshops.

[43]  Witold Pedrycz,et al.  A Novel Framework for Imputation of Missing Values in Databases , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[44]  Steven D. Brown,et al.  Comparison of five iterative imputation methods for multivariate classification , 2013 .