A Data-Clustering Algorithm on Distributed Memory Multiprocessors

To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the k-means clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent data-parallelism in the kmeans algorithm. We analytically show that the speedup and the scaleup of our algorithm approach the optimal as the number of data points increases. We implemented our algorithm on an IBM POWERparallel SP2 with a maximum of 16 nodes. On typical test data sets, we observe nearly linear relative speedups, for example, 15.62 on 16 nodes, and essentially linear scaleup in the size of the data set and in the number of clusters desired. For a 2 gigabyte test data set, our implementation drives the 16 node SP2 at more than 1.8 gigaflops.

[1]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[2]  G. W. Milligan,et al.  An algorithm for generating artificial test clusters , 1985 .

[3]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[4]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[5]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.

[6]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[7]  Vipin Kumar,et al.  ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[8]  G. P. King,et al.  Using cluster analysis to classify time series , 1992 .

[9]  Michael Randolph Garey,et al.  The complexity of the generalized Lloyd - Max problem , 1982, IEEE Trans. Inf. Theory.

[10]  Mohammed J. Zaki,et al.  Parallel Classi cation for Data Mining on Shared-Memory Multiprocessors , 1998 .

[11]  Jan O. Pedersen,et al.  Almost-constant-time clustering of arbitrary corpus subsets4 , 1997, SIGIR '97.

[12]  Sanjay Ranka,et al.  An effic ient k-means clustering algorithm , 1997 .

[13]  Wei Li,et al.  New parallel algorithms for fast discovery of associ-ation rules , 1997 .

[14]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[15]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules: Design, Implementation and Experience , 1999 .

[16]  Michelle Q. Wang Baldonado,et al.  SONIA: a service for organizing networked information autonomously , 1998, DL '98.

[17]  Moustafa Ghanem,et al.  Large Scale Data Mining: Challenges and Responses , 1997, KDD.

[18]  Vipin Kumar,et al.  Scalable parallel data mining for association rules , 1997, SIGMOD '97.

[19]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Computing k-Nearest Neighbors , 1975, IEEE Transactions on Computers.

[20]  Ilker Hamzaoglu,et al.  PADMA: PArallel Data Mining Agents for scalable text classification , 1997 .

[21]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[22]  Ron Musick,et al.  Scalable High Performance Computing for Knowledge Discovery and Data Mining , 1998, Springer US.

[23]  Inderjit S. Dhillon,et al.  Visualizing Class Structure of Multidimensional Data , 1998 .

[24]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[25]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[26]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[27]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[28]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[29]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[30]  Mohammed J. Zaki,et al.  Parallel classification for data mining on shared-memory multiprocessors , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[31]  Alex Alves Freitas,et al.  Mining Very Large Databases with Parallel Processing , 1997, The Kluwer International Series on Advances in Database Systems.

[32]  Marc Snir,et al.  The Communication Software and Parallel Environment of the IBM SP2 , 1995, IBM Syst. J..

[33]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[34]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[35]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[36]  Edie M. Rasmussen,et al.  Clustering Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[37]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[38]  Michael Ghil,et al.  Detecting Atmospheric Regimes Using Cross-Validated Clustering , 1997, KDD.