A novel fuzzy clustering algorithm with between-cluster information for categorical data

In this paper, we present a new fuzzy clustering algorithm for categorical data. In the algorithm, the objective function of the fuzzy k-modes algorithm is modified by adding the between-cluster information so that we can simultaneously minimize the within-cluster dispersion and enhance the between-cluster separation. For obtaining the local optimal solutions of the modified objective function, the corresponding update formulas of the membership matrix and the cluster prototypes are strictly derived. The convergence of the proposed algorithm under the optimization framework is proved. On several real data sets from UCI, the performance of the proposed algorithm is studied. The experimental results illustrate that the algorithm is effective and suitable for categorical data sets.

[1]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[2]  Jian Yu,et al.  A novel fuzzy clustering algorithm based on a fuzzy scatter matrix with optimality tests , 2005, Pattern Recognit. Lett..

[3]  Neil Wrigley,et al.  Categorical Data Analysis for Geographers and Environmental Scientists , 1985 .

[4]  Paul E. Green,et al.  K-modes Clustering , 2001, J. Classif..

[5]  Witold Pedrycz,et al.  The fuzzy C-means algorithm with fuzzy P-mode prototypes for clustering objects having mixed features , 2009, Fuzzy Sets Syst..

[6]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[7]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[8]  Jiye Liang,et al.  A novel attribute weighting algorithm for clustering high-dimensional categorical data , 2011, Pattern Recognit..

[9]  Eugenio Cesario,et al.  Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[10]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[11]  Jian Yu,et al.  Analysis of the weighting exponent in the FCM , 2004, IEEE Trans. Syst. Man Cybern. Part B.

[12]  Jiye Liang,et al.  A Framework for Clustering Categorical Time-Evolving Data , 2010, IEEE Transactions on Fuzzy Systems.

[13]  J. Bezdek A Physical Interpretation of Fuzzy ISODATA , 1993 .

[14]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[15]  Ming-Syan Chen,et al.  On Data Labeling for Clustering Categorical Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[16]  Jiye Liang,et al.  Consistency measure, inclusion degree and fuzzy measure in decision tables , 2008, Fuzzy Sets Syst..

[17]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[18]  Philip S. Yu,et al.  Finding Localized Associations in Market Basket Data , 2002, IEEE Trans. Knowl. Data Eng..

[19]  James C. Bezdek,et al.  Efficient Implementation of the Fuzzy c-Means Clustering Algorithms , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[21]  Sushil Jajodia,et al.  Applications of Data Mining in Computer Security , 2002, Advances in Information Security.

[22]  Witold Pedrycz,et al.  Kernel-based fuzzy clustering and fuzzy clustering: A comparative experimental study , 2010, Fuzzy Sets Syst..

[24]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[25]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[26]  Jiye Liang,et al.  An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data , 2011, Knowl. Based Syst..

[27]  James C. Bezdek,et al.  A comparison of neural network and fuzzy clustering techniques in segmenting magnetic resonance images of the brain , 1992, IEEE Trans. Neural Networks.

[28]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[29]  Jian Yu,et al.  Optimality test for generalized FCM and its application to parameter selection , 2005, IEEE Transactions on Fuzzy Systems.

[30]  Andreas D. Baxevanis,et al.  Bioinformatics - a practical guide to the analysis of genes and proteins , 2001, Methods of biochemical analysis.

[31]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[32]  James C. Bezdek,et al.  On cluster validity for the fuzzy c-means model , 1995, IEEE Trans. Fuzzy Syst..

[33]  Weina Wang,et al.  On fuzzy cluster validity indices , 2007, Fuzzy Sets Syst..

[34]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Edwin Diday,et al.  Symbolic clustering using a new dissimilarity measure , 1991, Pattern Recognit..