Improving K-Modes Algorithm Considering Frequencies of Attribute Values in Mode

In this paper, we present an experimental study on applying a new dissimilarity measure to the k-modes clustering algorithm to improve its clustering accuracy. The measure is based on the idea that the similarity between a data object and cluster mode, is directly proportional to the sum of relative frequencies of the common values in mode. Experimental results on real life datasets show that, the modified algorithm is superior to the original k-modes algorithm with respect to clustering accuracy.

[1]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[2]  Zengyou He,et al.  A cluster ensemble method for clustering categorical data , 2005, Information Fusion.

[3]  Zhengxin Chen,et al.  An iterative initial-points refinement algorithm for categorical data clustering , 2002, Pattern Recognit. Lett..

[4]  Zhang Yi,et al.  Clustering Categorical Data , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[5]  Jinyuan You,et al.  CLOPE: a fast and effective clustering algorithm for transactional data , 2002, KDD.

[6]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[7]  Zengyou He,et al.  Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..

[8]  Zengyou He,et al.  Mining class outliers: concepts, algorithms and applications in CRM , 2004, Expert Syst. Appl..

[9]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[10]  Ming-Syan Chen,et al.  Using category-based adherence to cluster market-basket data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[11]  Sam Yuan Sung,et al.  Caucus-based transaction clustering , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[12]  Ming-Syan Chen,et al.  An efficient clustering algorithm for market basket data based on small large ratios , 2001, 25th Annual International Computer Software and Applications Conference. COMPSAC 2001.

[13]  Vipin Kumar,et al.  Clustering Based On Association Rule Hypergraphs , 1997, DMKD.

[14]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[15]  Michael K. Ng,et al.  Clustering categorical data sets using tabu search techniques , 2002, Pattern Recognit..

[16]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[17]  Mohamed Nadif,et al.  Clustering Large Categorical Data , 2002, PAKDD.

[18]  Ke Wang,et al.  Clustering transactions using large items , 1999, CIKM '99.

[19]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[20]  Dan A. Simovici,et al.  Finding Median Partitions Using Information-Theoretical-Based Genetic Algorithms , 2002, J. Univers. Comput. Sci..

[21]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[22]  K. Chidananda Gowda,et al.  An ISODATA clustering procedure for symbolic objects using a distributed genetic algorithm , 1999, Pattern Recognit. Lett..

[23]  Fosca Giannotti,et al.  Clustering Transactional Data , 2002, PKDD.

[24]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[25]  He Zengyou,et al.  Squeezer: an efficient algorithm for clustering categorical data , 2002 .

[26]  Sushmita Mitra,et al.  Clustering and its validation in a symbolic framework , 2003, Pattern Recognit. Lett..

[27]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.