CKNNI: An Improved KNN-Based Missing Value Handling Technique

In data mining field, experimental data sets are often incomplete due to the imperfect nature of real world situations. However, the incompleteness of data sets generally leads to biased outcomes. Thus, data completeness is one of the most essential challenges among data mining tasks. In order to achieve better outcome many researchers have explored various techniques to reduce data incompleteness, and some existing methods have been widely used in real world applications. This paper first discusses some existing representative missing data handling techniques with their advantages and drawbacks. Then a new improved KNN based algorithm, Class-Based K-clusters Nearest Neighbor Imputation (CKNNI) is proposed, which integrates K-means cluster algorithm and conventional KNN algorithm to impute missing values in data sets. By clustering instances in the same class with K-means algorithm, CKNNI method then applies KNN algorithm to select a closest neighbor from the set of centroids in resulted clusters, and missing values are imputed with the ones from corresponding variables in a selected neighbor. Finally, the comparison based on multiple data sets indicates that CKNNI has improved the performance of KNN imputation significantly on large data sets yet comparative to other superior missing value handling algorithms.

[1]  B. Bakshi,et al.  Bayesian principal component analysis , 2002 .

[2]  Witold Pedrycz,et al.  Experimental analysis of methods for imputation of missing values in databases , 2004, SPIE Defense + Commercial Sensing.

[3]  Stefan Conrad,et al.  Clustering approaches for data with missing values: Comparison and evaluation , 2010, 2010 Fifth International Conference on Digital Information Management (ICDIM).

[4]  Craig K. Enders,et al.  Applied Missing Data Analysis , 2010 .

[5]  Edgar Acuña,et al.  The Treatment of Missing Values and its Effect on Classifier Accuracy , 2004 .

[6]  Zili Zhang,et al.  Missing Value Estimation for Mixed-Attribute Data Sets , 2011, IEEE Transactions on Knowledge and Data Engineering.

[7]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[8]  Nicholas Kushmerick,et al.  Learning to remove Internet advertisements , 1999, AGENTS '99.

[9]  David Banks,et al.  Classification, clustering, and data mining applications : proceedings of the meeting of the International Federation of Classification Societies (IFCS), Illinois Institute of Technology, Chicago, 15-18 July 2004 , 2004 .

[10]  Muhammad Ali Imran,et al.  Online anomaly rate parameter tracking for anomaly detection in wireless sensor networks , 2012, 2012 9th Annual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks (SECON).

[11]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[12]  Hong Yan,et al.  Microarray missing data imputation based on a set theoretic framework and biological knowledge , 2006, Nucleic acids research.

[13]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..