Multiobjective Genetic Algorithm-Based Fuzzy Clustering of Categorical Attributes

Recently, the problem of clustering categorical data, where no natural ordering among the elements of a categorical attribute domain can be found, has been gaining significant attention from researchers. With the growing demand for categorical data clustering, a few clustering algorithms with focus on categorical data have recently been developed. However, most of these methods attempt to optimize a single measure of the clustering goodness. Often, such a single measure may not be appropriate for different kinds of datasets. Thus, consideration of multiple, often conflicting, objectives appears to be natural for this problem. Although we have previously addressed the problem of multiobjective fuzzy clustering for continuous data, these algorithms cannot be applied for categorical data where the cluster means are not defined. Motivated by this, in this paper a multiobjective genetic algorithm-based approach for fuzzy clustering of categorical data is proposed that encodes the cluster modes and simultaneously optimizes fuzzy compactness and fuzzy separation of the clusters. Moreover, a novel method for obtaining the final clustering solution from the set of resultant Pareto-optimal solutions in proposed. This is based on majority voting among Pareto front solutions followed by k-nn classification. The performance of the proposed fuzzy categorical data-clustering techniques has been compared with that of some other widely used algorithms, both quantitatively and qualitatively. For this purpose, various synthetic and real-life categorical datasets have been considered. Also, a statistical significance test has been conducted to establish the significant superiority of the proposed multiobjective approach.

[1]  Marco Laumanns,et al.  SPEA2: Improving the strength pareto evolutionary algorithm , 2001 .

[2]  Joshua D. Knowles,et al.  An Evolutionary Approach to Multiobjective Clustering , 2007, IEEE Transactions on Evolutionary Computation.

[3]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  C. A. Coello Coello,et al.  A Comprehensive Survey of Evolutionary-Based Multiobjective Optimization Techniques , 1999, Knowledge and Information Systems.

[5]  P. Song,et al.  Clustering Categorical Data Based on Distance Vectors , 2006 .

[6]  Mahnhoon Lee On fuzzy cluster validity indices for the objects of mixed features , 2009, 2009 IEEE International Conference on Fuzzy Systems.

[7]  Ujjwal Maulik,et al.  Fuzzy partitioning using a real-coded variable-length genetic algorithm for pixel classification , 2003, IEEE Trans. Geosci. Remote. Sens..

[8]  Sotiris B. Kotsiantis,et al.  Fuzzy Clustering of Categorical Attributes and its Use in Analyzing Cultural Data , 2007, International Conference on Computational Intelligence.

[9]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[10]  S. Bandyopadhyay,et al.  Nonparametric genetic clustering: comparison of validity indices , 2001, IEEE Trans. Syst. Man Cybern. Syst..

[11]  P. Bickel,et al.  Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[12]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[13]  Rafael Caballero,et al.  Multiobjective Clustering with Metaheuristic Optimization Technology , 2006 .

[14]  Indraneel Das On characterizing the “knee” of the Pareto curve based on Normal-Boundary Intersection , 1999 .

[15]  Lothar Thiele,et al.  Multiobjective Optimization Using Evolutionary Algorithms - A Comparative Case Study , 1998, PPSN.

[16]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[17]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[18]  Ujjwal Maulik,et al.  An improved algorithm for clustering gene expression data , 2007, Bioinform..

[19]  Joshua D. Knowles,et al.  Multiobjective clustering around medoids , 2005, 2005 IEEE Congress on Evolutionary Computation.

[20]  Guido Stehr,et al.  Performance trade-off analysis of analog circuits by normal-boundary intersection , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[21]  A. Messac,et al.  Smart Pareto filter: obtaining a minimal representation of multiobjective design space , 2004 .

[22]  Lawrence. Davis,et al.  Handbook Of Genetic Algorithms , 1990 .

[23]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[24]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[25]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[26]  Vincenzo Catania,et al.  Multi-Objective Evolutionary Fuzzy Clustering for High-Dimensional Problems , 2007, 2007 IEEE International Fuzzy Systems Conference.

[27]  Ujjwal Maulik,et al.  Genetic algorithm-based clustering technique , 2000, Pattern Recognit..

[28]  Lothar Thiele,et al.  An evolutionary algorithm for multiobjective optimization: the strength Pareto approach , 1998 .

[29]  J. Bezdek,et al.  VAT: a tool for visual assessment of (cluster) tendency , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[30]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[31]  Kalyanmoy Deb,et al.  Multi-objective evolutionary algorithms: introducing bias among Pareto-optimal solutions , 2003 .

[32]  Kalyanmoy Deb,et al.  Multi-objective optimization using evolutionary algorithms , 2001, Wiley-Interscience series in systems and optimization.

[33]  Ujjwal Maulik,et al.  An evolutionary technique based on K-Means algorithm for optimal clustering in RN , 2002, Inf. Sci..

[34]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[35]  Kalyanmoy Deb,et al.  Finding Knees in Multi-objective Optimization , 2004, PPSN.

[36]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[37]  Michael K. Ng,et al.  A highly-usable projected clustering algorithm for gene expression profiles , 2003, BIOKDD.

[38]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[39]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[40]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[41]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[42]  Nostrand Reinhold,et al.  the utility of using the genetic algorithm approach on the problem of Davis, L. (1991), Handbook of Genetic Algorithms. Van Nostrand Reinhold, New York. , 1991 .

[43]  Joshua D. Knowles,et al.  Multi-Objective Clustering and Cluster Validation , 2006, Multi-Objective Machine Learning.

[44]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[45]  Pierre Hansen,et al.  Bicriterion Cluster Analysis , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Ujjwal Maulik,et al.  Multiobjective Genetic Clustering for Pixel Classification in Remote Sensing Imagery , 2007, IEEE Transactions on Geoscience and Remote Sensing.

[47]  L. Hubert,et al.  Comparing partitions , 1985 .

[48]  Ujjwal Maulik,et al.  A Simulated Annealing-Based Multiobjective Optimization Algorithm: AMOSA , 2008, IEEE Transactions on Evolutionary Computation.

[49]  Ujjwal Maulik,et al.  Performance Evaluation of Some Clustering Algorithms and Validity Indices , 2002, IEEE Trans. Pattern Anal. Mach. Intell..