Incremental learning based multiobjective fuzzy clustering for categorical data

The problem of clustering categorical data, where attribute values cannot be naturally ordered as numerical values, has gained more importance in recent time. Due to the special properties of categorical attributes, the clustering of categorical data seems to be more complicated than that of numerical data. Although, a few clustering algorithms that optimize single clustering objective have been proposed. It has found that such single measure may not be appropriate for all kind of datasets. Hence, in this article, an Incremental Learning based Multiobjective Fuzzy Clustering for Categorical Data is proposed. For this purpose, a multiobjective modified differential evolution based fuzzy clustering algorithm is developed. Thereafter, it integrates with the well-known supervised classifier, called random forest, using incremental learning to propose the aforementioned technique. Here, the multiobjective algorithm produces a set of optimal clustering solutions, known as Pareto-optimal solutions, by optimizing two conflicting objectives simultaneously. Subsequently, through incremental learning using random forest classifier final solution is evolved from the ensemble Pareto-optimal solutions. The results of the proposed method are demonstrated quantitatively and visually in comparison with widely used state-of-the-art methods for six synthetic and four real life datasets. Finally, statistical test is conducted to show the superiority of the results produced by the proposed method.

[1]  Dan A. Simovici,et al.  Finding Median Partitions Using Information-Theoretical-Based Genetic Algorithms , 2002, J. Univers. Comput. Sci..

[2]  Alex Alves Freitas,et al.  A Survey of Evolutionary Algorithms for Clustering , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[3]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[4]  Neil Wrigley,et al.  Categorical Data Analysis for Geographers and Environmental Scientists , 1985 .

[5]  Michael K. Ng,et al.  Clustering categorical data sets using tabu search techniques , 2002, Pattern Recognit..

[6]  Michael K. Ng,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Sushil Jajodia,et al.  Applications of Data Mining in Computer Security , 2002, Advances in Information Security.

[8]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[9]  Tossapon Boongoen,et al.  A Link-Based Cluster Ensemble Approach for Categorical Data Clustering , 2012, IEEE Transactions on Knowledge and Data Engineering.

[10]  Doheon Lee,et al.  Fuzzy clustering of categorical data using fuzzy centroids , 2004, Pattern Recognit. Lett..

[11]  R. Krishnapuram,et al.  A fuzzy relative of the k-medoids algorithm with application to web document and snippet clustering , 1999, FUZZ-IEEE'99. 1999 IEEE International Fuzzy Systems. Conference Proceedings (Cat. No.99CH36315).

[12]  Jiye Liang,et al.  A cluster centers initialization method for clustering categorical data , 2012, Expert Syst. Appl..

[13]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[14]  Ujjwal Maulik,et al.  Integrating Clustering and Supervised Learning for Categorical Data Analysis , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[15]  Arun K. Pujari,et al.  QROCK: A quick version of the ROCK algorithm for clustering of categorical data , 2005, Pattern Recognit. Lett..

[16]  Jennifer Blackhurst,et al.  MMR: An algorithm for clustering categorical data using Rough Set Theory , 2007, Data Knowl. Eng..

[17]  Joshua Zhexue Huang,et al.  A New Markov Model for Clustering Categorical Sequences , 2011, 2011 IEEE 11th International Conference on Data Mining.

[18]  Sushmita Mitra,et al.  Symbolic classification, clustering and fuzzy radial basis function network , 2005, Fuzzy Sets Syst..

[19]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[20]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[21]  Dariusz Plewczynski,et al.  Consensus classification of human leukocyte antigen class II proteins , 2012, Immunogenetics.

[22]  Sivakumar Ramakrishnan,et al.  A survey: hybrid evolutionary algorithms for cluster analysis , 2011, Artificial Intelligence Review.

[23]  Zengyou He,et al.  TCSOM: Clustering Transactions Using Self-Organizing Map , 2005, Neural Processing Letters.

[24]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[25]  Zengyou He,et al.  Attribute value weighting in k-modes clustering , 2011, Expert Syst. Appl..

[26]  Zengyou He,et al.  G-ANMI: A mutual information based genetic clustering algorithm for categorical data , 2010, Knowl. Based Syst..

[27]  JiGuan G. Lin Multiple-objective problems: Pareto-optimal solutions by method of proper equality constraints , 1976 .

[28]  Eugenio Cesario,et al.  Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[29]  Ujjwal Maulik,et al.  Clustering using Multi-objective Genetic Algorithm and its Application to Image Segmentation , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[30]  Edwin Diday,et al.  Symbolic clustering using a new dissimilarity measure , 1991, Pattern Recognit..

[31]  Kalyanmoy Deb,et al.  A Fast Elitist Non-dominated Sorting Genetic Algorithm for Multi-objective Optimisation: NSGA-II , 2000, PPSN.

[32]  P. Song,et al.  Clustering Categorical Data Based on Distance Vectors , 2006 .

[33]  Renée J. Miller,et al.  LIMBO: Scalable Clustering of Categorical Data , 2004, EDBT.

[34]  Jiye Liang,et al.  A new initialization method for categorical data clustering , 2009, Expert Syst. Appl..

[35]  Ujjwal Maulik,et al.  Modified differential evolution based fuzzy clustering for pixel classification in remote sensing imagery , 2009, Pattern Recognit..

[36]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Christopher J. L. Cunningham,et al.  Research Methods for the Behavioral and Social Sciences , 2010 .

[38]  J. Wu,et al.  A genetic fuzzy k-Modes algorithm for clustering categorical data , 2009, Expert Syst. Appl..

[39]  Ujjwal Maulik,et al.  Multiobjective Genetic Clustering for Pixel Classification in Remote Sensing Imagery , 2007, IEEE Transactions on Geoscience and Remote Sensing.

[40]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[41]  S. Bandyopadhyay,et al.  Nonparametric genetic clustering: comparison of validity indices , 2001, IEEE Trans. Syst. Man Cybern. Syst..

[42]  Philip S. Yu,et al.  Finding Localized Associations in Market Basket Data , 2002, IEEE Trans. Knowl. Data Eng..

[43]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[44]  Liang Bai,et al.  A dissimilarity measure for the k-Modes clustering algorithm , 2012, Knowl. Based Syst..

[45]  Ujjwal Maulik,et al.  A new multi-objective technique for differential fuzzy clustering , 2011, Appl. Soft Comput..

[46]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[47]  Jianyong Wang,et al.  On efficiently summarizing categorical databases , 2005, Knowledge and Information Systems.

[48]  Stefano Benati Categorical data fuzzy clustering: An analysis of local search heuristics , 2008, Comput. Oper. Res..

[49]  Zengyou He,et al.  A cluster ensemble method for clustering categorical data , 2005, Information Fusion.

[50]  Subhadip Basu,et al.  AMS 4.0: consensus prediction of post-translational modifications in protein sequences , 2012, Amino Acids.

[51]  Ujjwal Maulik,et al.  Fuzzy clustering of physicochemical and biochemical properties of amino Acids , 2011, Amino Acids.

[52]  J. Bezdek,et al.  VAT: a tool for visual assessment of (cluster) tendency , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[53]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[54]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[55]  Douglas A. Wolfe,et al.  Nonparametric Statistical Methods , 1973 .

[56]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[57]  Ujjwal Maulik,et al.  SVMeFC: SVM Ensemble Fuzzy Clustering for Satellite Image Segmentation , 2012, IEEE Geoscience and Remote Sensing Letters.

[58]  Ujjwal Maulik,et al.  Automatic Fuzzy Clustering Using Modified Differential Evolution for Image Classification , 2010, IEEE Transactions on Geoscience and Remote Sensing.

[59]  Andreas D. Baxevanis,et al.  Bioinformatics - a practical guide to the analysis of genes and proteins , 2001, Methods of biochemical analysis.

[60]  Doheon Lee,et al.  A k-populations algorithm for clustering categorical data , 2005, Pattern Recognit..

[61]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[62]  Julius T. Tou,et al.  Pattern Recognition Principles , 1974 .

[63]  Tengke Xiong,et al.  DHCC: Divisive hierarchical clustering of categorical data , 2011, Data Mining and Knowledge Discovery.

[64]  Jiye Liang,et al.  A Framework for Clustering Categorical Time-Evolving Data , 2010, IEEE Transactions on Fuzzy Systems.

[65]  Zengyou He,et al.  k-ANMI: A mutual information based clustering algorithm for categorical data , 2005, Inf. Fusion.

[66]  Alan D. Christiansen,et al.  An empirical study of evolutionary techniques for multiobjective optimization in engineering design , 1996 .

[67]  He Zengyou,et al.  Squeezer: an efficient algorithm for clustering categorical data , 2002 .

[68]  Ujjwal Maulik,et al.  Performance Evaluation of Some Clustering Algorithms and Validity Indices , 2002, IEEE Trans. Pattern Anal. Mach. Intell..