Incomplete data ensemble classification using imputation-revision framework with local spatial neighborhood information

Abstract Most existing machine learning techniques require complete data. However, incomplete patterns are common in many real-world scenarios due to the missing values (MVs). Various Missing value imputation (MVI) methods have been proposed to recover the MVs. Each of them has its own advantages in some scenarios. However, on the one hand, few of them consider taking advantages of different MVI methods; on the other hand, how to improve the imputation performance with local information is still an open problem. This paper proposes an imputation-revision framework with local spatial neighborhood information for incomplete data classification. The proposed method endeavors to combine the advantages of several imputation methods. It first obtains several complete datasets which are pre-filled by various MVI methods. Then, it detects the local neighborhood information (LNI) of samples and revises MVs based on the LNI. Finally, ensemble technique is employed to give a final decision. Numerical experiments have verified the superiority of the proposed method in terms of both prediction accuracy and algorithm stability.

[1]  Peter Clark,et al.  The CN2 induction algorithm , 2004, Machine Learning.

[2]  A. J. Feelders,et al.  Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation , 1999, PKDD.

[3]  Jieping Ye,et al.  Tensor Completion for Estimating Missing Values in Visual Data , 2013, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Qinbao Song,et al.  Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation , 2008, J. Syst. Softw..

[5]  Paola Sebastiani,et al.  Robust Bayes classifiers , 2001, Artif. Intell..

[6]  Yiwen Zhang,et al.  A selective neural network ensemble classification for incomplete data , 2016, International Journal of Machine Learning and Cybernetics.

[7]  Tero Aittokallio,et al.  Dealing with missing values in large-scale studies: microarray data imputation and beyond , 2010, Briefings Bioinform..

[8]  Daniel J. Mundfrom,et al.  Imputing Missing Values: The Effect on the Accuracy of Classification , 1998 .

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  Ivan G. Costa,et al.  Impact of missing data imputation methods on gene expression clustering and classification , 2015, BMC Bioinformatics.

[11]  Parham Moradi,et al.  An imputation-based matrix factorization method for improving accuracy of collaborative filtering systems , 2015, Eng. Appl. Artif. Intell..

[12]  R. Tibshirani,et al.  Gene expression profiling identifies clinically relevant subtypes of prostate cancer. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[13]  David G. Stork,et al.  Pattern Classification , 1973 .

[14]  Gavin Brown,et al.  Learn++.MF: A random subspace approach for the missing feature problem , 2010, Pattern Recognit..

[15]  Weixiong Zhang,et al.  Association-Based Multiple Imputation in Multivariate Datasets: A Summary , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[16]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[17]  Edgar Acuña,et al.  The Treatment of Missing Values and its Effect on Classifier Accuracy , 2004 .

[18]  Qiang He,et al.  Efficient Query of Quality Correlation for Service Composition , 2018, IEEE Transactions on Services Computing.

[19]  Patrick Royston,et al.  Multiple imputation using chained equations: Issues and guidance for practice , 2011, Statistics in medicine.

[20]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[21]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[22]  Yanping Zhang,et al.  A Parameter-Free Cleaning Method for SMOTE in Imbalanced Classification , 2019, IEEE Access.

[23]  Siyuan Liu,et al.  Anomaly Detection from Incomplete Data , 2014, TKDD.

[24]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[25]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[26]  Mengjie Zhang,et al.  Bagging and Feature Selection for Classification with Incomplete Data , 2017, EvoApplications.

[27]  Amir Jazaeri,et al.  Microarray analysis reveals distinct gene expression profiles among different histologic types of endometrial cancer. , 2003, Cancer research.

[28]  Jerzy W. Grzymala-Busse,et al.  A Comparison of Several Approaches to Missing Attribute Values in Data Mining , 2000, Rough Sets and Current Trends in Computing.

[29]  Tariq Samad,et al.  Imputation of Missing Data in Industrial Databases , 1999, Applied Intelligence.

[30]  R. Devi Priya,et al.  Heuristically repopulated Bayesian ant colony optimization for treating missing values in large databases , 2017, Knowl. Based Syst..

[31]  Seong G. Kong,et al.  Recent advances in visual and infrared face recognition - a review , 2005, Comput. Vis. Image Underst..

[32]  Craig K. Enders,et al.  Applied Missing Data Analysis , 2010 .

[33]  Ming Ouyang,et al.  DNA microarray data imputation and significance analysis of differential expression , 2005, Bioinform..

[34]  Marzena Kryszkiewicz,et al.  Rough Set Approach to Incomplete Information Systems , 1998, Inf. Sci..

[35]  Gustavo E. A. P. A. Batista,et al.  A Study of K-Nearest Neighbour as an Imputation Method , 2002, HIS.

[36]  Witold Pedrycz,et al.  Experimental analysis of methods for imputation of missing values in databases , 2004, SPIE Defense + Commercial Sensing.

[37]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[38]  Gangmin Ning,et al.  A computational model for heart failure stratification , 2011, 2011 Computing in Cardiology.

[39]  Haibin Zhu,et al.  Location-Aware Deep Collaborative Filtering for Service Recommendation , 2021, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[40]  Francisco Herrera,et al.  On the choice of the best imputation methods for missing values considering three groups of classification methods , 2012, Knowledge and Information Systems.

[41]  Terrence J. Sejnowski,et al.  Variational Bayesian Learning of ICA with Missing Data , 2003, Neural Computation.

[42]  Zibin Zheng,et al.  Covering-Based Web Service Quality Prediction via Neighborhood-Aware Matrix Factorization , 2019, IEEE Transactions on Services Computing.

[43]  Lukasz A. Kurgan,et al.  Impact of imputation of missing values on classification error for discrete data , 2008, Pattern Recognit..

[44]  Tao Dai,et al.  Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique , 2018, International journal of molecular sciences.

[45]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[46]  Witold Pedrycz,et al.  A Novel Framework for Imputation of Missing Values in Databases , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[47]  Robert Nowicki,et al.  Application of Rough Sets in k Nearest Neighbours Algorithm for Classification of Incomplete Samples , 2014, KICSS.

[48]  T. Stijnen,et al.  Review: a gentle introduction to imputation of missing values. , 2006, Journal of clinical epidemiology.

[49]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[50]  Steven D. Brown,et al.  Comparison of five iterative imputation methods for multivariate classification , 2013 .

[51]  Bing Xue,et al.  Proceedings in Adaptation, Learning and Optimization , 2016, IES.

[52]  Uzay Kaymak,et al.  Probabilistic fuzzy prediction of mortality in intensive care units , 2012, 2012 IEEE International Conference on Fuzzy Systems.

[53]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .