Missing data imputation by K nearest neighbours based on grey relational structure and mutual information

Treatment of missing data has become increasingly significant in scientific research and engineering applications. The classic imputation strategy based on the K nearest neighbours (KNN) has been widely used to solve the plague problem. However, former studies do not give much attention to feature relevance, which has a significant impact on the selection of nearest neighbours. As a result, biased results may appear in similarity measurements. In this paper, we propose a novel method to impute missing data, named feature weighted grey KNN (FWGKNN) imputation algorithm. This approach employs mutual information (MI) to measure feature relevance. We present an experimental evaluation for five UCI datasets in three missingness mechanisms with various missing rates. Experimental results show that feature relevance has a non-ignorable influence on missing data estimation based on grey theory, and our method is considered superior to the other four estimation strategies. Moreover, the classification bias can be significantly reduced by using our approach in classification tasks.

[1]  Wei-Pang Yang,et al.  A discretization algorithm based on Class-Attribute Contingency Coefficient , 2008, Inf. Sci..

[2]  Sahibsingh A. Dudani The Distance-Weighted k-Nearest-Neighbor Rule , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  María del Mar Rueda,et al.  New imputation methods for missing data using quantiles , 2009, J. Comput. Appl. Math..

[4]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[5]  Estevam R. Hruschka,et al.  Towards Efficient Imputation by Nearest-Neighbors: A Clustering-Based Approach , 2004, Australian Conference on Artificial Intelligence.

[6]  Wendi B. Heinzelman,et al.  Adaptive protocols for information dissemination in wireless sensor networks , 1999, MobiCom.

[7]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[8]  Michel Verleysen,et al.  K nearest neighbours with mutual information for simultaneous classification and missing data imputation , 2009, Neurocomputing.

[9]  Shichao Zhang,et al.  The Journal of Systems and Software , 2012 .

[10]  Ito Wasito,et al.  Nearest neighbour approach in the least-squares data imputation algorithms , 2005, Inf. Sci..

[11]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Iqbal Gondal,et al.  Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data , 2005, Bioinform..

[13]  Sung Cheol Yun,et al.  Imputation of Missing values. , 2004, Journal of preventive medicine and public health = Yebang Uihakhoe chi.

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  Chi-Chun Huang,et al.  A Grey-Based Nearest Neighbor Approach for Missing Attribute Value Prediction , 2004, Applied Intelligence.

[16]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[17]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[18]  Bing Yu,et al.  Missing data analyses: a hybrid multiple imputation algorithm using Gray System Theory and entropy based on clustering , 2013, Applied Intelligence.

[19]  Werasak Kurutach,et al.  Cluster-based KNN missing value imputation for DNA microarray data , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[20]  Jitender S. Deogun,et al.  Towards Missing Data Imputation: A Study of Fuzzy K-means Clustering Method , 2004, Rough Sets and Current Trends in Computing.

[21]  Jennifer Dixon,et al.  Modern Alternatives for Dealing with Missing Data in Special Education Research , 2006 .

[22]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[23]  Taghi M. Khoshgoftaar,et al.  Incomplete-Case Nearest Neighbor Imputation in Software Measurement Data , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[24]  Xiaofeng Zhu,et al.  Missing data imputation by utilizing information within incomplete instances , 2011, J. Syst. Softw..

[25]  Gerald Keller,et al.  Statistics for Management and Economics , 1990 .

[26]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[27]  Harri Niska,et al.  Methods for imputation of missing values in air quality data sets , 2004 .

[28]  Eduardo R. Hruschka,et al.  EACImpute: An Evolutionary Algorithm for Clustering-Based Imputation , 2009, 2009 Ninth International Conference on Intelligent Systems Design and Applications.

[29]  Jane Labadin,et al.  Feature selection based on mutual information , 2015, 2015 9th International Conference on IT in Asia (CITA).

[30]  Ki-Yeol Kim,et al.  Reuse of imputed data in microarray analysis increases imputation efficiency , 2004, BMC Bioinformatics.

[31]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[32]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[33]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[34]  Deng Ju-Long,et al.  Control problems of grey systems , 1982 .

[35]  Amit Gupta,et al.  Estimating Missing Values Using Neural Networks , 1996 .

[36]  Shichao Zhang,et al.  Parimputation: From Imputation and Null-Imputation to Partially Imputation , 2008, IEEE Intell. Informatics Bull..

[37]  Chengqi Zhang,et al.  Missing Value Imputation Based on Data Clustering , 2008, Trans. Comput. Sci..

[38]  Chengqi Zhang,et al.  GBKII: An Imputation Method for Missing Values , 2007, PAKDD.

[39]  Gene H. Golub,et al.  Imputation of missing values in DNA microarray gene expression data , 2004 .

[40]  M. Rueda,et al.  An improved estimator to analyse missing data , 2008 .

[41]  Leonardo Franco,et al.  Missing data imputation using statistical and machine learning methods in a real breast cancer problem , 2010, Artif. Intell. Medicine.

[42]  Lígia P. Brás,et al.  Improving cluster-based missing value estimation of DNA microarray data. , 2007, Biomolecular engineering.

[43]  Panos Liatsis,et al.  A robust missing value imputation method for noisy data , 2010, Applied Intelligence.

[44]  Sophie Midenet,et al.  Self-Organising Map for Data Imputation and Correction in Surveys , 2002, Neural Computing & Applications.

[45]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[46]  S. Kullback,et al.  Information Theory and Statistics , 1959 .

[47]  Chong-Ho Choi,et al.  Input Feature Selection by Mutual Information Based on Parzen Window , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[48]  Upmanu Lall,et al.  A Nearest Neighbor Bootstrap For Resampling Hydrologic Time Series , 1996 .

[49]  Alessandro G. Di Nuovo,et al.  Missing data analysis with fuzzy C-Means: A study of its application in a psychological scenario , 2011, Expert Syst. Appl..