Improving categorical data clustering algorithm by weighting uncommon attribute value matches

This paper presents an improved Squeezer algorithm for categorical data clustering by giving greater weight to uncommon attribute value matches in similarity computations. Experimental results on real life datasets show that, the modified algorithm is superior to the original Squeezer algorithm and other clustering algorithm with respect to clustering accuracy.

[1]  Chia-Hui Chang,et al.  Categorical data visualization and clustering using subjective factors , 2005, Data Knowl. Eng..

[2]  D. W. Goodall A New Similarity Index Based on Probability , 1966 .

[3]  He Zengyou,et al.  Squeezer: an efficient algorithm for clustering categorical data , 2002 .

[4]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[5]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[6]  Patrick Brézillon,et al.  Lecture Notes in Artificial Intelligence , 1999 .

[7]  Michael K. Ng,et al.  Clustering categorical data sets using tabu search techniques , 2002, Pattern Recognit..

[8]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[9]  Zengyou He,et al.  A cluster ensemble method for clustering categorical data , 2005, Information Fusion.

[10]  D. J. Newman,et al.  UCI Repository of Machine Learning Database , 1998 .

[11]  Gautam Biswas,et al.  Unsupervised Learning with Mixed Numeric and Nominal Data , 2002, IEEE Trans. Knowl. Data Eng..

[12]  Zengyou He,et al.  TCSOM: Clustering Transactions Using Self-Organizing Map , 2005, Neural Processing Letters.

[13]  Zengyou He,et al.  Squeezer: An efficient algorithm for clustering categorical data , 2008, Journal of Computer Science and Technology.

[14]  Zengyou He,et al.  Improving K-Modes Algorithm Considering Frequencies of Attribute Values in Mode , 2005, CIS.