Exploiting parallelism to support scalable hierarchical clustering

A distributed memory parallel version of the group average hierarchical agglomerative clustering algorithm is proposed to enable scaling the document clustering problem to large collections. Using standard message passing operations reduces interprocess communication while maintaining efficient load balancing. In a series of experiments using a subset of a standard Text REtrieval Conference (TREC) test collection, our parallel hierarchical clustering algorithm is shown to be scalable in terms of processors efficiently used and the collection size. Results show that our algorithm performs close to the expected O(n2sp) time on p processors rather than the worst-case O(n3sp) time. Furthermore, the O(n2sp) memory complexity per node allows larger collections to be clustered as the number of nodes increases. While partitioning algorithms such as k-means are trivially parallelizable, our results confirm those of other studies which showed that hierarchical algorithms produce significantly tighter clusters in the document clustering task. Finally, we show how our parallel hierarchical agglomerative clustering algorithm can be used as the clustering subroutine for a parallel version of the buckshot algorithm to cluster the complete TREC collection at near theoretical runtime expectations. © 2007 Wiley Periodicals, Inc.

[1]  Inderjit S. Dhillon,et al.  Efficient Clustering of Very Large Document Collections , 2001 .

[2]  Dik Lun Lee,et al.  Document Ranking and the Vector-Space Model , 1997, IEEE Softw..

[3]  Anil K. Jain,et al.  Artificial neural networks for feature extraction and multivariate data projection , 1995, IEEE Trans. Neural Networks.

[4]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[5]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[6]  Ellen M. Voorhees,et al.  Implementing agglomerative hierarchic clustering algorithms for use in document retrieval , 1986, Inf. Process. Manag..

[7]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[8]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[9]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[10]  Peter Scheuermann,et al.  Efficient Parallel Hierarchical Clustering , 2004, Euro-Par.

[11]  Carlos Ordonez,et al.  Efficient disk-based K-means clustering for relational databases , 2004, IEEE Transactions on Knowledge and Data Engineering.

[12]  Sanguthevar Rajasekaran Efficient parallel hierarchical clustering algorithms , 2005, IEEE Transactions on Parallel and Distributed Systems.

[13]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[14]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[15]  Wei-Ying Ma,et al.  Locality preserving indexing for document representation , 2004, SIGIR '04.

[16]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[17]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[18]  Ratko Orlandic,et al.  Clustering high-dimensional data using an efficient and effective data space reduction , 2005, CIKM '05.

[19]  Geoffrey C. Fox,et al.  MPIJAVA: An Object-Oriented JAVA Interface to MPI , 1999, IPPS/SPDP Workshops.

[20]  Ophir Frieder,et al.  Information retrieval - algorithms and heuristics , 1998, The Kluwer international series in engineering and computer science.

[21]  Beng Chin Ooi,et al.  Contorting high dimensional data for efficient main memory KNN processing , 2003, SIGMOD '03.

[22]  Ophir Frieder,et al.  IIT TREC-9 - Entity Based Feedback with Fusion , 2000, TREC.

[23]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[24]  Xiaobo Li,et al.  Parallel Algorithms for Hierarchical Clustering and Cluster Validity , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[26]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[27]  Oren Etzioni,et al.  Web document clustering , 1998, SIGIR 1998.

[28]  George Karypis,et al.  Improve precategorized collection retrieval by using supervised term weighting schemes , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[29]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[30]  Ophir Frieder,et al.  Clustering and classification of large document bases in a parallel environment , 1997 .

[31]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[32]  Shi Zhong,et al.  Efficient online spherical k-means clustering , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[33]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[34]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics (The Kluwer International Series on Information Retrieval) , 2004 .

[35]  Shi-Jinn Horng,et al.  Efficient Parallel Algorithms for Hierarchical Clustering on Arrays with Reconfigurable Optical Buses , 2000, J. Parallel Distributed Comput..

[36]  Hinrich Schütze,et al.  Projections for efficient document clustering , 1997, SIGIR '97.

[37]  Naftali Tishby,et al.  Sufficient Dimensionality Reduction , 2003, J. Mach. Learn. Res..

[38]  Clark F. Olson,et al.  Parallel Algorithms for Hierarchical Clustering , 1995, Parallel Comput..

[39]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[40]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[41]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[42]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[43]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[44]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[45]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[46]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[47]  David G. Stork,et al.  Pattern Classification , 1973 .

[48]  Philip J. Bernhard,et al.  Industrial evaluation of a highly-accurate academic IR system , 2003, CIKM '03.

[49]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[50]  Corrine Cheng,et al.  Incremental and effective data summarization for dynamic hierarchical clustering , 2004, SIGMOD '04.

[51]  Keke Chen,et al.  ClusterMap: labeling clusters in large datasets via visualization , 2004, CIKM '04.