Imputing Missing Values for Mixed Numeric and Categorical Attributes Based on Incomplete Data Hierarchical Clustering

Missing data imputation is a key issue of data pre-processing in data mining field. Though there are many methods for missing value imputation, almost each of these imputation methods has its limitation and is designed for either numeric attributes or categorical attributes. This paper presents IMIC, a new missing value Imputation method for Mixed numeric and categorical attributes based on Incomplete data hierarchical clustering after the introduction of a new concept Incomplete Set Mixed Feature Vector (ISMFV). The effect of the new method is valuated through the comparison experiment using 3 real data sets from UCI.

[1]  Liu Hui,et al.  Study of the Case of Learning Bayesian Network from Complete Data , 2009, 2009 Second International Symposium on Knowledge Acquisition and Modeling.

[2]  Xinghuo Yu,et al.  AI 2004: Advances in Artificial Intelligence, 17th Australian Joint Conference on Artificial Intelligence, Cairns, Australia, December 4-6, 2004, Proceedings , 2004, Australian Conference on Artificial Intelligence.

[3]  Cao Yonghui Study of the Case of Learning Bayesian Network from Incomplete Data , 2009, 2009 International Conference on Information Management, Innovation Management and Industrial Engineering.

[4]  Mary M. Randolph-Gips A New Neural Network to Process Missing Data without Imputation , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[5]  Ki-Yeol Kim,et al.  Reuse of imputed data in microarray analysis increases imputation efficiency , 2004, BMC Bioinformatics.

[6]  Marzena Kryszkiewicz,et al.  Rules in Incomplete Information Systems , 1999, Inf. Sci..

[7]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[8]  Thomas L. Marzetta,et al.  Nonparametric spectral analysis with missing data via the EM algorithm , 2004 .

[9]  Roderick J. A. Little,et al.  The Analysis of Social Science Data with Missing Values , 1989 .

[10]  WU Jun-hao Missing value estimation for gene expression data based on Mahalanobis distance , 2005 .

[11]  Marzena Kryszkiewicz,et al.  Rough Set Approach to Incomplete Information Systems , 1998, Inf. Sci..

[12]  Wang Ling,et al.  Estimation of Missing Values Using a Weighted K-Nearest Neighbors Algorithm , 2009, 2009 International Conference on Environmental Science and Information Application Technology.

[13]  Wei-Zhi Wu,et al.  Attribute reduction based on evidence theory in incomplete decision systems , 2008, Inf. Sci..

[14]  Sam Efromovich,et al.  Nonparametric Regression With Predictors Missing at Random , 2011 .

[15]  Jerzy W. Grzymala-Busse,et al.  Rough Sets , 1995, Commun. ACM.

[16]  Estevam R. Hruschka,et al.  Towards Efficient Imputation by Nearest-Neighbors: A Clustering-Based Approach , 2004, Australian Conference on Artificial Intelligence.

[17]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[18]  Ling Wang,et al.  Modelling method with missing values based on clustering and support vector regression , 2010 .

[19]  YanWang,et al.  Missing value estimation for microarray data based on fuzzy C-means clustering , 2005, Eighth International Conference on High-Performance Computing in Asia-Pacific Region (HPCASIA'05).

[20]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[21]  Jitendra Malik,et al.  Blobworld: Image Segmentation Using Expectation-Maximization and Its Application to Image Querying , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..