Parallel hierarchical clustering on shared memory platforms

Hierarchical clustering has many advantages over traditional clustering algorithms like k-means, but it suffers from higher computational costs and a less obvious parallel structure. Thus, in order to scale this technique up to larger datasets, we present SHRINK, a novel shared-memory algorithm for single-linkage hierarchical clustering based on merging the solutions from overlapping sub-problems. In our experiments, we find that SHRINK provides a speedup of 18–20 on 36 cores on both real and synthetic datasets of up to 250,000 points. Source code for SHRINK is available for download on our website, http://cucis.ece.northwestern.edu.

[1]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[2]  David A. Bader,et al.  Fast Shared-Memory Algorithms for Computing the Minimum Spanning Forest of Sparse Graphs , 2004, IPDPS.

[3]  Feng Lin,et al.  A novel parallelization approach for hierarchical clustering , 2005, Parallel Comput..

[4]  Peter Scheuermann,et al.  Efficient Parallel Hierarchical Clustering , 2004, Euro-Par.

[5]  Anne Condon,et al.  Parallel implementation of Bouvka's minimum spanning tree algorithm , 1996, Proceedings of International Conference on Parallel Processing.

[6]  Jon Louis Bentley,et al.  A Parallel Algorithm for Constructing Minimum Spanning Trees , 1980, J. Algorithms.

[7]  Seth Pettie,et al.  A Randomized Time-Work Optimal Parallel Algorithm for Finding a Minimum Spanning Forest , 1999, RANDOM-APPROX.

[8]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Satoru Miyano,et al.  Open source clustering software , 2004 .

[10]  Ronald L. Rivest,et al.  Introduction to Algorithms, 3rd Edition , 2009 .

[11]  A. Hagler,et al.  Chemoinformatics and Drug Discovery , 2002, Molecules : A Journal of Synthetic Chemistry and Natural Product Chemistry.

[12]  Peter Willett,et al.  Comparison of Hierarchie Agglomerative Clustering Methods for Document Retrieval , 1989, Comput. J..

[13]  David A. Bader,et al.  A fast, parallel spanning tree algorithm for symmetric multiprocessors , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[14]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[15]  Vijaya Chung,et al.  A Randomized Linear-Work EREW PRAM Algorithm to Find a Minimum Spanning Forest , 2003, Algorithmica.

[16]  Clark F. Olson,et al.  Parallel Algorithms for Hierarchical Clustering , 1995, Parallel Comput..

[17]  Ying Xu,et al.  Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics , 2009, IEEE/ACM Transactions on Computational Biology & Bioinformatics.

[18]  Richard Cole,et al.  Finding minimum spanning forests in logarithmic time and linear work using random sampling , 1996, SPAA '96.

[19]  Ming Ouyang,et al.  Hierarchical Clustering with CUDA/GPU , 2009, PDCCS.

[20]  Frank Dehne,et al.  Practical parallel algorithms for minimum spanning trees , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).