论文信息 - Scalable clustering: a distributed approach

Scalable clustering: a distributed approach

The ever-increasing size of data sets and poor scalability of clustering algorithms has drawn attention to distributed clustering for partitioning large data sets. In this paper we propose an algorithm to cluster large-scale data sets without clustering all the data at a time. Data is randomly divided into almost equal size disjoint subsets. We then cluster each subset using the hard-k means or fuzzy k-means algorithm. The centroids of subsets form an ensemble. A centroid correspondence algorithm transitively solves the correspondence problem among the ensemble of centroids. The centroids are combined to form a global set of centroids. Experimental results show that most of the time the pattern of clusters generated by our algorithm is similar to the pattern of clusters generated by clustering all the data at a time. We have shown that the disputed examples between the clusters generated by our algorithm and clustering all the data at a time lay on the spatial border of clusters.

Lawrence O. Hall | Prodip Hore | L. Hall | P. Hore

[1] Joydeep Ghosh,et al. A Supra-Classifier Architecture for Scalable Knowledge Reuse , 1998, ICML.

[2] Charles Elkan,et al. Scalability for clustering algorithms revisited , 2000, SKDD.

[3] Joydeep Ghosh,et al. Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[4] Ian Davidson,et al. Speeding up k-means Clustering by Bootstrap Averaging , 2003 .

[5] D. J. Newman,et al. UCI Repository of Machine Learning Database , 1998 .

[6] Lawrence O. Hall,et al. Fast Accurate Fuzzy Clustering through Data Reduction , 2003 .

[7] Frank Höppner. Speeding up fuzzy c-means: using a hierarchical data organisation to control the precision of membership calculation , 2002, Fuzzy Sets Syst..

[8] Ana L. N. Fred,et al. Finding Consistent Clusters in Data Partitions , 2001, Multiple Classifier Systems.

[9] Anil K. Jain,et al. Combining multiple weak clusterings , 2003, Third IEEE International Conference on Data Mining.