The Journal of Systems and Software

Existing kNN imputation methods for dealing with missing data are designed according to Minkowski distance or its variants, and have been shown to be generally efficient for numerical variables (features, or attributes). To deal with heterogeneous (i.e., mixed-attributes) data, we propose a novel kNN (k nearest neighbor) imputation method to iteratively imputing missing data, named GkNN (gray kNN) imputation. GkNN selects k nearest neighbors for each missing datum via calculating the gray distance between the missing datum and all the training data rather than traditional distance metric methods, such as Euclidean distance. Such a distance metric can deal with both numerical and categorical attributes. For achieving the better effectiveness, GkNN regards all the imputed instances (i.e., the missing data been imputed) as observed data, which with complete instances (instances without missing values) together to iteratively impute other missing data. We experimentally evaluate the proposed approach, and demonstrate that the gray distance is much better than the Minkowski distance at both capturing the proximity relationship (or nearness) of two instances and dealing with mixed attributes. Moreover, experimental results also show that the GkNN algorithm is much more efficient than existent kNN imputation methods.

[1]  Eduardo R. Hruschka,et al.  EACImpute: An Evolutionary Algorithm for Clustering-Based Imputation , 2009, 2009 Ninth International Conference on Intelligent Systems Design and Applications.

[2]  Shichao Zhang,et al.  "Missing is useful": missing values in cost-sensitive decision trees , 2005, IEEE Transactions on Knowledge and Data Engineering.

[3]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[4]  Upmanu Lall,et al.  A Nearest Neighbor Bootstrap For Resampling Hydrologic Time Series , 1996 .

[5]  Chengqi Zhang,et al.  Semi-parametric optimization for missing data imputation , 2007, Applied Intelligence.

[6]  Chi-Chun Huang,et al.  A Grey-Based Nearest Neighbor Approach for Missing Attribute Value Prediction , 2004, Applied Intelligence.

[7]  Ronald K. Pearson,et al.  The problem of disguised missing data , 2006, SKDD.

[8]  Chengqi Zhang,et al.  Optimized parameters for missing data imputation , 2006 .

[9]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[10]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[11]  Shichao Zhang,et al.  Noisy data elimination using mutual k-nearest neighbor for classification mining , 2012, J. Syst. Softw..

[12]  Ronald K. Pearson,et al.  Mining imperfect data - dealing with contamination and incomplete records , 2005 .

[13]  J. Ross Quinlan,et al.  Unknown Attribute Values in Induction , 1989, ML.

[14]  Ki-Yeol Kim,et al.  Reuse of imputed data in microarray analysis increases imputation efficiency , 2004, BMC Bioinformatics.

[15]  J. G. Skellam STUDIES IN STATISTICAL ECOLOGY SPATIAL PATTERN , 1952 .

[16]  Shizhao Zhang,et al.  KNN-CF Approach: Incorporating Certainty Factor to kNN Classification , 2010, IEEE Intell. Informatics Bull..

[17]  Chi-Chun Huang,et al.  An instance-based learning approach based on grey relational structure , 2006, Applied Intelligence.

[18]  Szu-Lin Su,et al.  Grey-based power control for DS-CDMA cellular mobile systems , 2000, IEEE Trans. Veh. Technol..

[19]  Deng Ju-Long,et al.  Control problems of grey systems , 1982 .

[20]  J. Shao,et al.  Nearest Neighbor Imputation for Survey Data , 2000 .

[21]  Qinbao Song,et al.  Using grey relational analysis to predict software effort with small data sets , 2005, 11th IEEE International Software Metrics Symposium (METRICS'05).

[22]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[23]  Martin Ravallion,et al.  Survey Compliance and the Distribution of Income , 2003 .

[24]  Lígia P. Brás,et al.  Improving cluster-based missing value estimation of DNA microarray data. , 2007, Biomolecular engineering.

[25]  J. N. K. Rao,et al.  Empirical likelihood-based inference under imputation for missing response data , 2002 .

[26]  Shichao Zhang,et al.  Decision tree classifiers sensitive to heterogeneous costs , 2012, J. Syst. Softw..

[27]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[28]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[29]  Xindong Wu,et al.  Mining bridging rules between conceptual clusters , 2010, Applied Intelligence.

[30]  Shichao Zhang,et al.  Parimputation: From Imputation and Null-Imputation to Partially Imputation , 2008, IEEE Intell. Informatics Bull..

[31]  Shichao Zhang,et al.  Shell-neighbor method and its application in missing data imputation , 2011, Applied Intelligence.

[32]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[33]  Bernard C. Jiang,et al.  Machine vision-based gray relational theory applied to IC marking inspection , 2002 .

[34]  Yi-Fan Wang,et al.  On-Demand Forecasting of Stock Prices Using a Real-Time Predictor , 2003, IEEE Trans. Knowl. Data Eng..

[35]  Ren C. Luo,et al.  Target tracking using a hierarchical grey-fuzzy motion decision-making method , 2001, IEEE Trans. Syst. Man Cybern. Part A.

[36]  Joseph Anthony Navarro,et al.  STUDIES IN STATISTICAL ECOLOGY , 1955 .

[37]  Chengqi Zhang,et al.  GBKII: An Imputation Method for Missing Values , 2007, PAKDD.

[38]  Jussi Myllymaki Effective Web data extraction with standard XML technologies , 2002, Comput. Networks.

[39]  Shizhao Zhang,et al.  K NN-CF Approach: Incorporating Certainty Factor to k NN Classification. , 2010 .

[40]  Zili Zhang,et al.  Missing Value Estimation for Mixed-Attribute Data Sets , 2011, IEEE Transactions on Knowledge and Data Engineering.

[41]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[42]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[43]  Rich Caruana,et al.  A Non-Parametric EM-Style Algorithm for Imputing Missing Values , 2001, AISTATS.

[44]  Jer-Min Jou,et al.  The gray prediction search algorithm for block motion estimation , 1999, IEEE Trans. Circuits Syst. Video Technol..