Dealing with Missing Data: Algorithms Based on Fuzzy Set and Rough Set Theories

Missing data, commonly encountered in many fields of study, introduce inaccuracy in the analysis and evaluation. Previous methods used for handling missing data (e.g., deleting cases with incomplete information, or substituting the missing values with estimated mean scores), though simple to implement, are problematic because these methods may result in biased data models. Fortunately, recent advances in theoretical and computational statistics have led to more flexible techniques to deal with the missing data problem. In this paper, we present missing data imputation methods based on clustering, one of the most popular techniques in Knowledge Discovery in Databases (KDD). We combine clustering with soft computing, which tends to be more tolerant of imprecision and uncertainty, and apply fuzzy and rough clustering algorithms to deal with incomplete data. The experiments show that a hybridization of fuzzy set and rough set theories in missing data imputation algorithms leads to the best performance among our four algorithms, i.e., crisp K-means, fuzzy K-means, rough K-means, and rough-fuzzy K-means imputation algorithms.

[1]  P. Roth MISSING DATA: A CONCEPTUAL REVIEW FOR APPLIED PSYCHOLOGISTS , 1994 .

[2]  Jitender S. Deogun,et al.  Towards Missing Data Imputation: A Study of Fuzzy K-means Clustering Method , 2004, Rough Sets and Current Trends in Computing.

[3]  Rui Yan,et al.  Comparison of Conventional and Rough K-Means Clustering , 2003, RSFDGrC.

[4]  Anupam Joshi,et al.  Low-complexity fuzzy relational clustering algorithms for Web mining , 2001, IEEE Trans. Fuzzy Syst..

[5]  Jerzy W. Grzymala-Busse,et al.  Data with Missing Attribute Values: Generalization of Indiscernibility Relation and Rule Induction , 2004, Trans. Rough Sets.

[6]  Jerzy W. Grzymala-Busse,et al.  Rough Set Strategies to Data with Missing Attribute Values , 2006, Foundations and Novel Approaches in Data Mining.

[7]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[8]  M. Narasimha Murty,et al.  An adaptive rough fuzzy single pass algorithm for clustering large data sets , 2003, Pattern Recognit..

[9]  Sholom M. Weiss,et al.  Decision-Rule Solutions for Data Mining with Missing Values , 2000, IBERAMIA-SBIA.

[10]  Jitender S. Deogun,et al.  Interpolation Models for Spatiotemporal Association Mining , 2004, Fundam. Informaticae.

[11]  Jitender S. Deogun,et al.  Spatio-Temporal Association Mining for Un-sampled Sites , 2003, ISMIS.

[12]  Sadaaki Miyamoto,et al.  Rough Sets and Current Trends in Computing , 2012, Lecture Notes in Computer Science.

[13]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[14]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[15]  James F. Peters,et al.  K-means Indiscernibility Relation over Pixels , 2004, Rough Sets and Current Trends in Computing.

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  Ingunn Myrtveit,et al.  Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods , 2001, IEEE Trans. Software Eng..

[18]  Maurice K. Wong,et al.  Algorithm AS136: A k-means clustering algorithm. , 1979 .

[19]  Jerzy W. Grzymala-Busse,et al.  Rough Sets , 1995, Commun. ACM.

[20]  Shusaku Tsumoto,et al.  Foundations of Intelligent Systems, 15th International Symposium, ISMIS 2005, Saratoga Springs, NY, USA, May 25-28, 2005, Proceedings , 2005, ISMIS.

[21]  Jitender S. Deogun,et al.  Interpolation Techniques for Geo-spatial Association Rule Mining , 2003, RSFDGrC.

[22]  Lotfi A. Zadeh,et al.  Fuzzy Sets , 1996, Inf. Control..

[23]  Jitender S. Deogun,et al.  Efficient Rule Discovery in a Geo-spatial Decision Support System , 2002, DG.O.

[24]  Jerzy W. Grzymala-Busse,et al.  A Rough Set Approach to Data with Missing Attribute Values , 2006, RSKT.

[25]  Ergun Akleman,et al.  Generalized distance functions , 1999, Proceedings Shape Modeling International '99. International Conference on Shape Modeling and Applications.

[26]  Tu Bao Ho,et al.  Cluster-Based Algorithms for Dealing with Missing Values , 2002, PAKDD.

[27]  Sankar K. Pal,et al.  Rough fuzzy MLP: knowledge encoding and classification , 1998, IEEE Trans. Neural Networks.

[28]  Ronald R. Yager,et al.  Using fuzzy methods to model nearest neighbor rules , 2002, IEEE Trans. Syst. Man Cybern. Part B.