A scalable algorithm for single-linkage hierarchical clustering on distributed-memory architectures

Hierarchical clustering is a fundamental and widely-used clustering algorithm with many advantages over traditional partitional clustering. Due to the explosion in size of modern scientific datasets, there is a pressing need for scalable analytics algorithms, but good scaling is difficult to achieve for hierarchical clustering due to data dependencies inherent in the algorithm. To the best of our knowledge, no previous work on parallel hierarchical clustering has shown scalability beyond a couple hundred processes. In this paper, we present PINK, a scalable parallel algorithm for single-linkage hierarchical clustering based on decomposing a problem instance into two different types of subproblems. Despite the heterogeneous workloads, our algorithm exhibits good load balancing, as well as low memory requirements and a communication pattern that is both low-volume and deterministic. Evaluating PINK on up to 6050 processes, we find that it achieves speedups up to approximately 6600.

[1]  Wei-keng Liao,et al.  Parallel hierarchical clustering on shared memory platforms , 2012, 2012 19th International Conference on High Performance Computing.

[2]  Hillol Kargupta,et al.  Collective, Hierarchical Clustering from Distributed, Heterogeneous Data , 1999, Large-Scale Parallel Data Mining.

[3]  Fenglou Mao,et al.  Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  A. Hagler,et al.  Chemoinformatics and Drug Discovery , 2002, Molecules : A Journal of Synthetic Chemistry and Natural Product Chemistry.

[5]  Vijaya Chung,et al.  A Randomized Linear-Work EREW PRAM Algorithm to Find a Minimum Spanning Forest , 2003, Algorithmica.

[6]  Clark F. Olson,et al.  Parallel Algorithms for Hierarchical Clustering , 1995, Parallel Comput..

[7]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Jon Louis Bentley,et al.  A Parallel Algorithm for Constructing Minimum Spanning Trees , 1980, J. Algorithms.

[9]  P. Thomas,et al.  The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model , 2007, astro-ph/0701407.

[10]  G. Lemson,et al.  Halo and Galaxy Formation Histories from the Millennium Simulation: Public release of a VO-oriented and SQL-queryable database for studying the evolution of galaxies in the LambdaCDM cosmogony , 2006, astro-ph/0608019.

[11]  Peter Scheuermann,et al.  Efficient Parallel Hierarchical Clustering , 2004, Euro-Par.

[12]  Anne Condon,et al.  Parallel implementation of Bouvka's minimum spanning tree algorithm , 1996, Proceedings of International Conference on Parallel Processing.

[13]  G. Lucia,et al.  The hierarchical formation of the brightest cluster galaxies , 2006, astro-ph/0606519.

[14]  Feng Lin,et al.  A novel parallelization approach for hierarchical clustering , 2005, Parallel Comput..

[15]  Frank Dehne,et al.  Practical parallel algorithms for minimum spanning trees , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[16]  David A. Bader,et al.  Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[17]  Seth Pettie,et al.  A Randomized Time-Work Optimal Parallel Algorithm for Finding a Minimum Spanning Forest , 1999, RANDOM-APPROX.

[18]  Oxford,et al.  Breaking the hierarchy of galaxy formation , 2005, astro-ph/0511338.

[19]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[20]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[21]  J. Peacock,et al.  Simulations of the formation, evolution and clustering of galaxies and quasars , 2005, Nature.

[22]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[23]  Peter Willett,et al.  Comparison of Hierarchie Agglomerative Clustering Methods for Document Retrieval , 1989, Comput. J..

[24]  David A. Bader,et al.  A fast, parallel spanning tree algorithm for symmetric multiprocessors , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[25]  J. Gower,et al.  Minimum Spanning Trees and Single Linkage Cluster Analysis , 1969 .

[26]  Richard Cole,et al.  Finding minimum spanning forests in logarithmic time and linear work using random sampling , 1996, SPAA '96.