Ensemble based rough fuzzy clustering for categorical data

Categorical data is different from continuous data, where the values of attribute do not follow any natural ordering. Moreover, inherent complexities like uncertainty, vagueness and overlapping among clusters make the analysis of real life categorical data set more difficult. Recent literature review shows that the well-known categorical data clustering techniques are using different similarity/dissimilarity measures to tackle the inherent complexities of the categorical attribute values. Generally, it is hard to find single method and cluster validity measure that can be used as perfect or standard for all kinds of categorical data sets. Hence, in this paper first, a clustering method for categorical data is proposed by fusing rough set and fuzzy set theories. Subsequently, an ensemble based framework is designed with the recently proposed similarity/dissimilarity measures in order to have better clustering results for different types of categorical data sets. For this purpose, the proposed rough fuzzy clustering method is used sequentially with the integration of different measures to evolve the clustering solutions. Using consensus of these solutions, pure classified, semi rough and pure rough points are identified. Thereafter, machine learning method, called Random Forest, is used in incremental way to classify the semi and pure rough points using pure classified points to yield better clustering results. The performance of the proposed method has been demonstrated in comparison with several other recently developed clustering methods. Additionally, the selection of Random Forest in the proposed framework is justified by comparing its performance with other well-known machine learning methods like K-Nearest Neighbor and Support Vector Machine. Ten categorical data sets are used for the experimental purpose. Finally, statistical significance test has been conducted to judge the superiority of the results.

[1]  Sankar K. Pal,et al.  RFCM: A Hybrid Clustering Algorithm Using Rough and Fuzzy Sets , 2007, Fundam. Informaticae.

[2]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[3]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[4]  R. Edrada-Ebel,et al.  A chemometric study of chromatograms of tea extracts by correlation optimization warping in conjunction with PCA, support vector machines and random forest data modeling. , 2009, Analytica chimica acta.

[5]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[6]  Zengyou He,et al.  TCSOM: Clustering Transactions Using Self-Organizing Map , 2005, Neural Processing Letters.

[7]  Liang Bai,et al.  A dissimilarity measure for the k-Modes clustering algorithm , 2012, Knowl. Based Syst..

[8]  Edward Y. Chang,et al.  Using one-class and two-class SVMs for multiclass image annotation , 2005, IEEE Transactions on Knowledge and Data Engineering.

[9]  Edwin Diday,et al.  Symbolic clustering using a new dissimilarity measure , 1991, Pattern Recognit..

[10]  Xin Lu,et al.  A random forest of combined features in the classification of cut tobacco based on gas chromatography fingerprinting. , 2010, Talanta.

[11]  Renée J. Miller,et al.  LIMBO: Scalable Clustering of Categorical Data , 2004, EDBT.

[12]  Jianhui Lin,et al.  A Rough-Set-Based Incremental Approach for Updating Approximations under Dynamic Maintenance Environments , 2013, IEEE Transactions on Knowledge and Data Engineering.

[13]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[14]  Jiye Liang,et al.  A new initialization method for categorical data clustering , 2009, Expert Syst. Appl..

[15]  Pawan Lingras,et al.  Interval Set Clustering of Web Users with Rough K-Means , 2004, Journal of Intelligent Information Systems.

[16]  Zengyou He,et al.  G-ANMI: A mutual information based genetic clustering algorithm for categorical data , 2010, Knowl. Based Syst..

[17]  Jinyuan You,et al.  CLOPE: a fast and effective clustering algorithm for transactional data , 2002, KDD.

[18]  Eugenio Cesario,et al.  Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[19]  Doheon Lee,et al.  A k-populations algorithm for clustering categorical data , 2005, Pattern Recognit..

[20]  K. Chidananda Gowda,et al.  Divisive clustering of symbolic objects using the concepts of both similarity and dissimilarity , 1995, Pattern Recognit..

[21]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[22]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[23]  Ujjwal Maulik,et al.  Incremental learning based multiobjective fuzzy clustering for categorical data , 2014, Inf. Sci..

[24]  He Zengyou,et al.  Squeezer: an efficient algorithm for clustering categorical data , 2002 .

[25]  El-Ghazali Talbi,et al.  Clustering Nominal and Numerical Data: A New Distance Concept for a Hybrid Genetic Algorithm , 2004, EvoCOP.

[26]  Yiyu Yao,et al.  Probabilistic rough set approximations , 2008, Int. J. Approx. Reason..

[27]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[28]  J. Wu,et al.  A genetic fuzzy k-Modes algorithm for clustering categorical data , 2009, Expert Syst. Appl..

[29]  Ujjwal Maulik,et al.  Modified differential evolution based fuzzy clustering for pixel classification in remote sensing imagery , 2009, Pattern Recognit..

[30]  Ujjwal Maulik,et al.  Integrating Clustering and Supervised Learning for Categorical Data Analysis , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[31]  Dan A. Simovici,et al.  Finding Median Partitions Using Information-Theoretical-Based Genetic Algorithms , 2002, J. Univers. Comput. Sci..

[32]  Michael K. Ng,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Witold Pedrycz,et al.  Rough–Fuzzy Collaborative Clustering , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[34]  T. Hancock,et al.  Adaptive wavelet modelling of a nested 3 factor experimental design in NIR chemometrics , 2006 .

[35]  P. Song,et al.  Clustering Categorical Data Based on Distance Vectors , 2006 .

[36]  Yiyu Yao,et al.  A Decision Theoretic Framework for Approximating Concepts , 1992, Int. J. Man Mach. Stud..

[37]  Zengyou He,et al.  A cluster ensemble method for clustering categorical data , 2005, Information Fusion.

[38]  Jiye Liang,et al.  A Framework for Clustering Categorical Time-Evolving Data , 2010, IEEE Transactions on Fuzzy Systems.

[39]  Zengyou He,et al.  k-ANMI: A mutual information based clustering algorithm for categorical data , 2005, Inf. Fusion.

[40]  Z. Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data , 1991 .

[41]  Uday B. Desai,et al.  3D object recognition using Bayesian geometric hashing and pose clustering , 2003, Pattern Recognit..

[42]  Ming-Syan Chen,et al.  On Data Labeling for Clustering Categorical Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[43]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[44]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[45]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[46]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[47]  James E. Gentle,et al.  Finding Groups in Data: An Introduction to Cluster Analysis. , 1991 .

[48]  Tianrui Li,et al.  Composite rough sets for dynamic data mining , 2014, Inf. Sci..

[49]  Ming-Syan Chen,et al.  Clustering Categorical Data Using the Correlated-Force Ensemble , 2004, SDM.

[50]  P. Bickel,et al.  Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[51]  S. Miyamoto,et al.  FORMULATIONS OF FUZZY CLUSTERING FOR CATEGORICAL DATA , 2005 .

[52]  W. L. Ruzzo,et al.  An empirical study on Principal Component Analysis for clustering gene expression data , 2000 .

[53]  Edwin Diday,et al.  Unsupervised learning through symbolic clustering , 1991, Pattern Recognit. Lett..

[54]  Stefano Benati Categorical data fuzzy clustering: An analysis of local search heuristics , 2008, Comput. Oper. Res..

[55]  Doheon Lee,et al.  Fuzzy clustering of categorical data using fuzzy centroids , 2004, Pattern Recognit. Lett..

[56]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[57]  Jiye Liang,et al.  A cluster centers initialization method for clustering categorical data , 2012, Expert Syst. Appl..

[58]  C. Lee Giles,et al.  Clustering and identifying temporal trends in document databases , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[59]  Arun K. Pujari,et al.  QROCK: A quick version of the ROCK algorithm for clustering of categorical data , 2005, Pattern Recognit. Lett..

[60]  Derrick S. Boone,et al.  Retail segmentation using artificial neural networks , 2002 .

[61]  Geert Wets,et al.  A rough sets based characteristic relation approach for dynamic attribute generalization in data mining , 2007, Knowl. Based Syst..

[62]  Guoyin Wang,et al.  A Rough Set-Based Method for Updating Decision Rules on Attribute Values’ Coarsening and Refining , 2014, IEEE Transactions on Knowledge and Data Engineering.

[63]  M. Pardo,et al.  Random forests and nearest shrunken centroids for the classification of sensor array data , 2008 .

[64]  J. Bezdek,et al.  VAT: a tool for visual assessment of (cluster) tendency , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[65]  Lotfi A. Zadeh,et al.  Fuzzy Sets , 1996, Inf. Control..

[66]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[67]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[68]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[69]  Ujjwal Maulik,et al.  Medical Image Segmentation Using Genetic Algorithms , 2009, IEEE Transactions on Information Technology in Biomedicine.

[70]  Michael K. Ng,et al.  Clustering categorical data sets using tabu search techniques , 2002, Pattern Recognit..

[71]  Jennifer Blackhurst,et al.  MMR: An algorithm for clustering categorical data using Rough Set Theory , 2007, Data Knowl. Eng..

[72]  Zengyou He,et al.  Attribute value weighting in k-modes clustering , 2011, Expert Syst. Appl..