Automatic Fuzzy Clustering Using Non-Dominated Sorting Particle Swarm Optimization Algorithm for Categorical Data

Categorical data clustering has been attracted a lot of attention recently due to its necessary in the real-world applications. Many clustering methods have been proposed for categorical data. However, most of the existing algorithms require the predefined number of clusters which is usually unavailable in real-world problems. Only a few works focused on automatic clustering, but mainly handled for numerical data. This study develops a novel automatic fuzzy clustering using non-dominated sorting particle swarm optimization (AFC-NSPSO) algorithm for categorical data. The proposed AFC-NSPSO algorithm can automatically identify the optimal number of clusters and exploit the clustering result with the corresponding selected number of clusters. In addition, a new technique is investigated to identify the maximum number of clusters in a dataset based on the local density. To select a final solution in the first Pareto front, some internal validation indices are used. The performance of the proposed AFC-NSPSO on the real-world datasets collected from the UCI machine learning repository exhibits effectiveness compared with some other existing automatic categorical clustering algorithms. Besides, this study also applies the proposed algorithm to analyze a real-world case study with an unknown number of clusters.

[1]  Xiaodong Li,et al.  A Non-dominated Sorting Particle Swarm Optimizer for Multiobjective Optimization , 2003, GECCO.

[2]  Uday Babbar,et al.  Detecting the Number of Clusters during Expectation-Maximization Clustering Using Information Criterion , 2010, 2010 Second International Conference on Machine Learning and Computing.

[3]  Keke Chen,et al.  “Best K”: critical clustering structures in categorical datasets , 2008, Knowledge and Information Systems.

[4]  Sriparna Saha,et al.  A generalized automatic clustering algorithm in a multiobjective framework , 2013, Appl. Soft Comput..

[5]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[7]  M. Sugeno FUZZY MEASURES AND FUZZY INTEGRALS—A SURVEY , 1993 .

[8]  R. J. Kuo,et al.  Non-dominated sorting genetic algorithm using fuzzy membership chromosome for categorical data clustering , 2015, Appl. Soft Comput..

[9]  Max Welling,et al.  Bayesian k-Means as a Maximization-Expectation Algorithm , 2009, Neural Computation.

[10]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[11]  Krassimir T. Atanassov,et al.  Intuitionistic fuzzy sets , 1986 .

[12]  Olatz Arbelaitz,et al.  An extensive comparative study of cluster validity indices , 2013, Pattern Recognit..

[13]  Ajith Abraham,et al.  Multi-Objective Differential Evolution for Automatic Clustering with Application to Micro-Array Data Analysis , 2009, Sensors.

[14]  R. J. Kuo,et al.  Automatic Clustering Using an Improved Particle Swarm Optimization , 2013 .

[15]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[16]  Jiye Liang,et al.  A new initialization method for categorical data clustering , 2009, Expert Syst. Appl..

[17]  Eugenio Cesario,et al.  Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[18]  Argyris Kalogeratos,et al.  Dip-means: an incremental clustering method for estimating the number of clusters , 2012, NIPS.

[19]  Dervis Karaboga,et al.  A comprehensive survey of traditional, merge-split and evolutionary approaches proposed for determination of cluster number , 2017, Swarm Evol. Comput..

[20]  Geoffrey H. Ball,et al.  ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .

[21]  Hanqiang Liu,et al.  A multiobjective spatial fuzzy clustering algorithm for image segmentation , 2015, Appl. Soft Comput..

[22]  Eréndira Rendón,et al.  Internal versus External cluster validation indexes , 2011 .

[23]  R. J. Kuo,et al.  Integration of particle swarm optimization and genetic algorithm for dynamic clustering , 2012, Inf. Sci..

[24]  Mohamed Bouguessa Clustering categorical data in projected spaces , 2013, Data Mining and Knowledge Discovery.

[25]  Michael K. Ng,et al.  Categorical data clustering with automatic selection of cluster number , 2009 .

[26]  M. Tahar Kechadi,et al.  A multi-act sequential game-based multi-objective clustering approach for categorical data , 2017, Neurocomputing.

[27]  Qiang Zhang A Subspace Clustering Algorithm , 2010, 2010 6th International Conference on Wireless Communications Networking and Mobile Computing (WiCOM).

[28]  Goo Jun,et al.  GX-Means: A model-based divide and merge algorithm for geospatial image clustering , 2011, ICCS.

[29]  Xindong Wu,et al.  Automatic clustering using genetic algorithms , 2011, Appl. Math. Comput..

[30]  Siripen Wikaisuksakul,et al.  A multi-objective genetic algorithm with fuzzy c-means for automatic data clustering , 2014, Appl. Soft Comput..

[31]  Min Ren,et al.  A Self-Adaptive Fuzzy c-Means Algorithm for Determining the Optimal Number of Clusters , 2016, Comput. Intell. Neurosci..

[32]  Zhiping Zhou,et al.  Kernel-based multiobjective clustering algorithm with automatic attribute weighting , 2018, Soft Comput..

[33]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[34]  Tengke Xiong,et al.  DHCC: Divisive hierarchical clustering of categorical data , 2011, Data Mining and Knowledge Discovery.

[35]  R. J. Kuo,et al.  Automatic kernel clustering with bee colony optimization algorithm , 2014, Inf. Sci..

[36]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[37]  S. Bandyopadhyay,et al.  Nonparametric genetic clustering: comparison of validity indices , 2001, IEEE Trans. Syst. Man Cybern. Syst..

[38]  Amit Konar,et al.  Automatic kernel clustering with a Multi-Elitist Particle Swarm Optimization Algorithm , 2008, Pattern Recognit. Lett..

[39]  K. Lewis,et al.  Pareto analysis in multiobjective optimization using the collinearity theorem and scaling method , 2001 .

[40]  Rong Zheng,et al.  RECOME: a New Density-Based Clustering Algorithm Using Relative KNN Kernel Density , 2016, Inf. Sci..

[41]  Mingjin Yan,et al.  Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion , 2005 .

[42]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[43]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[44]  Ka Yee Yeung,et al.  Details of the Adjusted Rand index and Clustering algorithms Supplement to the paper “ An empirical study on Principal Component Analysis for clustering gene expression data ” ( to appear in Bioinformatics ) , 2001 .