ParChain: A Framework for Parallel Hierarchical Agglomerative Clustering using Nearest-Neighbor Chain

This paper studies the hierarchical clustering problem, where the goal is to produce a dendrogram that represents clusters at varying scales of a data set. We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms, and using the framework we obtain novel parallel algorithms for the complete linkage, average linkage, and Ward’s linkage criteria. Compared to most previous parallel HAC algorithms, which require quadratic memory, our new algorithms require only linear memory, and are scalable to large data sets. ParChain is based on our parallelization of the nearest-neighbor chain algorithm, and enables multiple clusters to be merged on every round. We introduce two key optimizations that are critical for efficiency: a range query optimization that reduces the number of distance computations required when finding nearest neighbors of clusters, and a caching optimization that stores a subset of previously computed distances, which are likely to be reused. Experimentally, we show that our highly-optimized implementations using 48 cores with two-way hyper-threading achieve 5.8– 110.1x speedup over state-of-the-art parallel HAC algorithms and achieve 13.75–54.23x self-relative speedup. Compared to state-ofthe-art algorithms, our algorithms require up to 237.3x less space. Our algorithms are able to scale to data set sizes with tens of millions of points, which existing algorithms are not able to handle.

[1]  Yan Gu,et al.  Fast Parallel Algorithms for Euclidean Minimum Spanning Tree and Hierarchical Spatial Clustering , 2021, SIGMOD Conference.

[2]  Saturnino Maldonado-Bascón,et al.  Fast reciprocal nearest neighbors clustering , 2012, Signal Process..

[3]  Fionn Murtagh,et al.  Algorithms for hierarchical clustering: an overview, II , 2017, WIREs Data Mining Knowl. Discov..

[4]  Silvio Lattanzi,et al.  Affinity Clustering: Hierarchical Clustering at Scale , 2017, NIPS.

[5]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[6]  Paul B. Callahan Optimal parallel all-nearest-neighbors using the well-separated pair decomposition , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[7]  Wei Zhang,et al.  DHC: A Distributed Hierarchical Clustering Algorithm for Large Datasets , 2019, J. Circuits Syst. Comput..

[8]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[9]  Andrew McCallum,et al.  Scalable Bottom-Up Hierarchical Clustering , 2020, ArXiv.

[10]  Claire Mathieu,et al.  Hierarchical Clustering , 2017, SODA.

[11]  Xing Xie,et al.  Learning transportation mode from raw gps data for geographic applications on the web , 2008, WWW.

[12]  Michael Cochez,et al.  Twister Tries: Approximate Hierarchical Agglomerative Clustering for Average Distance in Linear Time , 2015, SIGMOD Conference.

[13]  Christos Levcopoulos,et al.  Optimal Algorithms for Complete Linkage Clustering in d Dimensions , 2002, MFCS.

[14]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[15]  신애자,et al.  1998 , 2001, The Winning Cars of the Indianapolis 500.

[16]  Sadique Sheik,et al.  Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring , 2015 .

[17]  William B. March,et al.  Fast euclidean minimum spanning tree: algorithm, analysis, and applications , 2010, KDD.

[18]  Daniel Müllner,et al.  Modern hierarchical, agglomerative clustering algorithms , 2011, ArXiv.

[19]  Xiaobo Li,et al.  Parallel clustering algorithms , 1989, Parallel Comput..

[20]  S. Hewitt,et al.  1980 , 1980, Literatur in der SBZ/DDR.

[21]  Shlomo Moran,et al.  Optimal implementations of UPGMA and other common clustering algorithms , 2007, Inf. Process. Lett..

[22]  Yingjie Tian,et al.  A Comprehensive Survey of Clustering Algorithms , 2015, Annals of Data Science.

[23]  Benjamin Moseley,et al.  Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-means, and Local Search , 2017, NIPS.

[24]  William G. Mckendree,et al.  ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences , 2009, Nucleic acids research.

[25]  Peter Scheuermann,et al.  Efficient Parallel Hierarchical Clustering , 2004, Euro-Par.

[26]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[27]  Akshay Krishnamurthy,et al.  Scalable Hierarchical Clustering with Tree Grafting , 2019, KDD.

[28]  Nil Mamano Grande New Applications of the Nearest-Neighbor Chain Algorithm , 2019 .

[29]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Vahab S. Mirrokni,et al.  Hierarchical Agglomerative Graph Clustering in Nearly-Linear Time , 2021, ICML.

[31]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[32]  R. Huerta,et al.  Online Humidity and Temperature Decorrelation of Chemical Sensors for Continuous Monitoring , 2016 .

[33]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[34]  Michel Bruynooghe,et al.  Méthodes nouvelles en classification automatique de données taxinomiques nombreuses , 1977 .

[35]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[36]  Clark F. Olson,et al.  Parallel Algorithms for Hierarchical Clustering , 1995, Parallel Comput..

[37]  R. Bocka,et al.  Methods for multidimensional event classification: a case study using images from a Cherenkov gamma-ray telescope , 2003 .

[38]  Elon Portugaly,et al.  Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space , 2008, ISMB.

[39]  Amir Abboud,et al.  Subquadratic High-Dimensional Hierarchical Clustering , 2019, NeurIPS.

[40]  Guy E. Blelloch,et al.  Reducing contention through priority updates , 2013, PPoPP '13.

[41]  Philip M. Long,et al.  Performance guarantees for hierarchical clustering , 2002, J. Comput. Syst. Sci..

[42]  Joel Nothman,et al.  SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.

[43]  Ernst Althaus,et al.  A Greedy Algorithm for Hierarchical Complete Linkage Clustering , 2014, AlCoB.

[44]  Ram Samudrala,et al.  fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data , 2014, Bioinform..

[45]  Chee Keong Kwoh,et al.  SparseHC: A Memory-efficient Online Hierarchical Clustering Algorithm , 2014, ICCS.

[46]  Ian Davidson,et al.  Efficient hierarchical clustering of large high dimensional datasets , 2013, CIKM.

[47]  J. Juan Programme de classification hiérarchique par l'algorithme de la recherche en chaîne des voisins réciproques , 1982 .

[48]  William B. March,et al.  Tree-Independent Dual-Tree Algorithms , 2013, ICML.

[49]  Eamonn J. Keogh,et al.  An indexing scheme for fast similarity search in large time series databases , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[50]  Sanguthevar Rajasekaran Efficient parallel hierarchical clustering algorithms , 2005, IEEE Transactions on Parallel and Distributed Systems.

[51]  J.-P. Benzécri,et al.  Rappel : Construction d'une classification ascendante hiérarchique par la recherche en chaîne des voisins réciproques , 1997 .

[52]  Xiaobo Li,et al.  Parallel Algorithms for Hierarchical Clustering and Cluster Validity , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[53]  C. de Rham,et al.  La classification hiérarchique ascendante selon la méthode des voisins réciproques , 1980 .

[54]  Meelis Kull,et al.  Fast approximate hierarchical clustering using similarity heuristics , 2008, BioData Mining.

[55]  Feng Lin,et al.  A novel parallelization approach for hierarchical clustering , 2005, Parallel Comput..

[56]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[57]  Fabrizio Lillo,et al.  Correlation, Hierarchies, and Networks in Financial Markets , 2008, 0809.4615.

[58]  Sanjoy Dasgupta,et al.  A cost function for similarity-based hierarchical clustering , 2015, STOC.

[59]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..

[60]  Ophir Frieder,et al.  Exploiting parallelism to support scalable hierarchical clustering , 2007, J. Assoc. Inf. Sci. Technol..

[61]  Raymond Greenlaw,et al.  On the parallel complexity of hierarchical clustering and CC-complete problems , 2008, Complex..

[62]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[63]  Daniel Müllner,et al.  fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python , 2013 .

[64]  William B. March,et al.  Plug-and-play dual-tree algorithm runtime analysis , 2015, J. Mach. Learn. Res..

[65]  Fionn Murtagh,et al.  Algorithms for hierarchical clustering: an overview , 2012, WIREs Data Mining Knowl. Discov..

[66]  Petr Savický,et al.  Methods for multidimensional event classification: A case study using images from a Cherenkov gamma-ray telescope , 2004 .

[67]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[68]  Bernt Schiele,et al.  Efficient Clustering and Matching for Object Class Recognition , 2006, BMVC.

[69]  A. Azzouz 2011 , 2020, City.

[70]  Sungroh Yoon,et al.  Multi-Threaded Hierarchical Clustering by Parallel Nearest-Neighbor Chaining , 2015, IEEE Transactions on Parallel and Distributed Systems.

[71]  김선경,et al.  1999 , 2000, Les 25 ans de l’OMC: Une rétrospective en photos.

[72]  S. Hewitt,et al.  1982 , 1982, Qatar 1975/76-2019.