Integrated Rough Fuzzy Clustering for Categorical data Analysis

Abstract In recent times, advanced data mining research has been mostly focusing on clustering of categorical data, where a natural ordering in attribute values is missing. To address this fact the Rough Fuzzy K-Modes clustering technique has been recently developed in order to handle imperfect information, i.e. indiscernibility (coarseness) and vagueness within the dataset. However, it has been observed that the said technique suffers from the problem of local optima due to the random choice of initial cluster modes. Hence, in this paper, we have proposed an integrated clustering technique using multi-phase learning. In this regard, first, Simulated Annealing based Rough Fuzzy K-Modes and Genetic Algorithm based Rough Fuzzy K-Modes are proposed in order to perform the clustering better by considering clustering as an underlying optimization problem. These clustering methods individually produce clusters having set of central and peripheral points. Thereafter, for each case, final improved clustering results are obtained by assigning peripheral points to a particular crisp cluster using Random Forest, where central points are used as training set. Second, the varying cardinality of the training and testing sets produced by each clustering method further motivated us to propose a generalized technique called Integrated Rough Fuzzy Clustering using Random Forest, where, results of three aforementioned clustering techniques are used to compute the roughness measure. Based on this measure, three different sets namely best central points, semi-best central points and pure peripheral points are determined. Thereafter, using multi-phase learning, best central points are used to classify the semi-best central points and then using both of them, pure peripheral points are classified by Random Forest. Experimental results are reported quantitatively and visually to demonstrate the effectiveness of the proposed methods in comparison with well-known state-of-the-art methods for six synthetic and five real-life datasets. Finally, statistical significance tests are conducted to establish the superiority of the results produced by the proposed methods.

[1]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  J. Bezdek,et al.  VAT: a tool for visual assessment of (cluster) tendency , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[3]  Jennifer Blackhurst,et al.  MMR: An algorithm for clustering categorical data using Rough Set Theory , 2007, Data Knowl. Eng..

[4]  Fengmei Liang,et al.  Recognition Algorithm Based on Improved FCM and Rough Sets for Meibomian Gland Morphology , 2017 .

[5]  Zengyou He,et al.  A cluster ensemble method for clustering categorical data , 2005, Information Fusion.

[6]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[7]  James E. Gentle,et al.  Finding Groups in Data: An Introduction to Cluster Analysis. , 1991 .

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Michael K. Ng,et al.  Clustering categorical data sets using tabu search techniques , 2002, Pattern Recognit..

[10]  Zengyou He,et al.  Attribute value weighting in k-modes clustering , 2011, Expert Syst. Appl..

[11]  Witold Pedrycz,et al.  Advances in Fuzzy Clustering and its Applications , 2007 .

[12]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[13]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[14]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[15]  Liang Bai,et al.  A dissimilarity measure for the k-Modes clustering algorithm , 2012, Knowl. Based Syst..

[16]  Richard Weber,et al.  Soft clustering - Fuzzy and rough approaches and their extensions and derivatives , 2013, Int. J. Approx. Reason..

[17]  P. Bickel,et al.  Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[18]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[19]  Jeba Emilyn Jeyaswamidoss,et al.  A rough set based rational clustering framework for determining correlated genes. , 2016, Acta microbiologica et immunologica Hungarica.

[20]  Pawan Lingras,et al.  Interval Set Clustering of Web Users with Rough K-Means , 2004, Journal of Intelligent Information Systems.

[21]  Sankar K. Pal,et al.  RFCM: A Hybrid Clustering Algorithm Using Rough and Fuzzy Sets , 2007, Fundam. Informaticae.

[22]  Musa Peker,et al.  A decision support system to improve medical diagnosis using a combination of k-medoids clustering based attribute weighting and SVM , 2016, Journal of Medical Systems.

[23]  Witold Pedrycz,et al.  Rough–Fuzzy Collaborative Clustering , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[24]  Ujjwal Maulik,et al.  A study of some fuzzy cluster validity indices, genetic clustering and application to pixel classification , 2005, Fuzzy Sets Syst..

[25]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[26]  Richard Weber,et al.  Evolutionary Rough k-Medoid Clustering , 2008, Trans. Rough Sets.

[27]  Zengyou He,et al.  G-ANMI: A mutual information based genetic clustering algorithm for categorical data , 2010, Knowl. Based Syst..

[28]  Jerzy W. Grzymala-Busse,et al.  Rough Sets , 1995, Commun. ACM.

[29]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[30]  El-Ghazali Talbi,et al.  Clustering Nominal and Numerical Data: A New Distance Concept for a Hybrid Genetic Algorithm , 2004, EvoCOP.

[31]  Pradipta Maji,et al.  Rough-Fuzzy Clustering for Grouping Functionally Similar Genes from Microarray Data , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  SANGHAMITRA BANDYOPADHYAY,et al.  Clustering Using Simulated Annealing with Probabilistic Redistribution , 2001, Int. J. Pattern Recognit. Artif. Intell..

[33]  Zengyou He,et al.  k-ANMI: A mutual information based clustering algorithm for categorical data , 2005, Inf. Fusion.

[34]  Ujjwal Maulik,et al.  Ensemble based rough fuzzy clustering for categorical data , 2015, Knowl. Based Syst..

[35]  Emile H. L. Aarts,et al.  Simulated Annealing: Theory and Applications , 1987, Mathematics and Its Applications.

[36]  Pawan Lingras,et al.  Analysis of Rough and Fuzzy Clustering , 2010, RSKT.

[37]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[39]  Pawan Lingras,et al.  Rough K-medoids clustering using GAs , 2009, 2009 8th IEEE International Conference on Cognitive Informatics.

[40]  Tao Chen,et al.  Model-based multidimensional clustering of categorical data , 2012, Artif. Intell..

[41]  Sanghamitra Bandyopadhyay Simulated annealing using a reversible jump Markov chain Monte Carlo algorithm for fuzzy clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[42]  He Zengyou,et al.  Squeezer: an efficient algorithm for clustering categorical data , 2002 .

[43]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[44]  Ujjwal Maulik,et al.  Modified differential evolution based fuzzy clustering for pixel classification in remote sensing imagery , 2009, Pattern Recognit..

[45]  Günter Rudolph,et al.  Convergence analysis of canonical genetic algorithms , 1994, IEEE Trans. Neural Networks.

[46]  Ujjwal Maulik,et al.  Genetic algorithm-based clustering technique , 2000, Pattern Recognit..

[47]  Didier Dubois,et al.  Putting Rough Sets and Fuzzy Sets Together , 1992, Intelligent Decision Support.

[48]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Chun-Bao Chen,et al.  Rough Set-Based Clustering with Refinement Using Shannon's Entropy Theory , 2006, Comput. Math. Appl..

[50]  Soon-H. Kwon Cluster validity index for fuzzy clustering , 1998 .

[51]  Zengyou He,et al.  TCSOM: Clustering Transactions Using Self-Organizing Map , 2005, Neural Processing Letters.

[52]  Jianyong Chen,et al.  Efficient Clustering Method Based on Rough Set and Genetic Algorithm , 2011 .