Empirical comparison of fast clustering algorithms for large data sets

Several fast algorithms for clustering very large data sets have been proposed in the literature. CLARA is a combination of a sampling procedure and the classical PAM algorithm, while CLARANS adopts a serial randomized search strategy to find the optimal set of medoids. GAC-R/sup 3/ and GAC-RAR/sub w/ exploit genetic search heuristics for solving clustering problems. In this research, we conducted an empirical comparison of these four clustering algorithms over a wide range of data characteristics. According to the experimental results, CLARANS outperforms its counterparts both in clustering quality and execution time when the number of clusters increases, clusters are more closely related, more asymmetric clusters are present, or more random objects exist in the data set. With a specific number of clusters, CLARA can efficiently achieve satisfactory clustering quality when the data size is larger, whereas GAC-R/sup 3/ and GAC-RAR/sub w/ can achieve satisfactory clustering quality and efficiency when the data size is small, the number of clusters is small, and clusters are more distinct or symmetric.

[1]  A. K. Pujari,et al.  Data Mining Techniques , 2006 .

[2]  GeneticAlgorithmsVladimir Estivill Spatial Clustering for Data Mining with , 1997 .

[3]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[4]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[5]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[6]  Rolf Stadler,et al.  Discovering Data Mining: From Concept to Implementation , 1997 .

[7]  Philip S. Yu,et al.  Data Mining: An Overview from a Database Perspective , 1996, IEEE Trans. Knowl. Data Eng..

[8]  Gregory Piatetsky-Shapiro,et al.  Knowledge Discovery in Databases: An Overview , 1992, AI Mag..

[9]  Robert F. Ling,et al.  Cluster analysis algorithms for data reduction and classification of objects , 1981 .

[10]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[11]  John R. Koza,et al.  Genetic Programming II , 1992 .

[12]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[13]  Kevin Knight,et al.  Artificial intelligence (2. ed.) , 1991 .

[14]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[15]  Mohamed Zaït,et al.  A comparative study of clustering methods , 1997, Future Gener. Comput. Syst..

[16]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .