Scalable Bottom-Up Hierarchical Clustering

Bottom-up algorithms such as the classic hierarchical agglomerative clustering, are highly effective for hierarchical as well as flat clustering. However, the large number of rounds and their sequential nature limit the scalability of agglomerative clustering. In this paper, we present an alternative round-based bottom-up hierarchical clustering, the Sub-Cluster Component Algorithm (SCC), that scales gracefully to massive datasets. Our method builds many sub-clusters in parallel in a given round and requires many fewer rounds -- usually an order of magnitude smaller than classic agglomerative clustering. Our theoretical analysis shows that, under a modest separability assumption, SCC will contain the optimal flat clustering. SCC also provides a 2-approx solution to the DP-means objective, thereby introducing a novel application of hierarchical clustering methods. Empirically, SCC finds better hierarchies and flat clusterings even when the data does not satisfy the separability assumption. We demonstrate the scalability of our method by applying it to a dataset of 30 billion points and showing that SCC produces higher quality clusterings than the state-of-the-art.

[1]  John Yen,et al.  An incremental approach to building a cluster hierarchy , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[2]  Ricardo J. G. B. Campello,et al.  Hierarchical Density-Based Clustering Using MapReduce , 2019, IEEE Transactions on Big Data.

[3]  Rebecca C. Steorts,et al.  Performance Bounds for Graphical Record Linkage , 2017, AISTATS.

[4]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[5]  Daniel Jurafsky,et al.  Citation-based bootstrapping for large-scale author disambiguation , 2012, J. Assoc. Inf. Sci. Technol..

[6]  Eric P. Xing,et al.  Large-scale Distributed Dependent Nonparametric Trees , 2015, ICML.

[7]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[8]  Andrew McCallum,et al.  Supervised Hierarchical Clustering with Exponential Linkage , 2019, ICML.

[9]  Silvio Lattanzi,et al.  A Framework for Parallelizing Hierarchical Clustering Methods , 2019, ECML/PKDD.

[10]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[11]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[12]  Katherine A. Heller,et al.  Randomized algorithms for fast Bayesian hierarchical clustering , 2005 .

[13]  Akshay Krishnamurthy,et al.  Scalable Hierarchical Clustering with Tree Grafting , 2019, KDD.

[14]  Eric P. Xing,et al.  Distributed, partially collapsed MCMC for Bayesian Nonparametrics , 2020, AISTATS.

[15]  Hector Garcia-Molina,et al.  Evaluating entity resolution results , 2010, Proc. VLDB Endow..

[16]  Ricardo J. G. B. Campello,et al.  Density-Based Clustering Based on Hierarchical Density Estimates , 2013, PAKDD.

[17]  Moses Charikar,et al.  Approximate Hierarchical Clustering via Sparsest Cut and Spreading Metrics , 2016, SODA.

[18]  M. Bruynooghe,et al.  Classification ascendante hiérarchique des grands ensembles de données : un algorithme rapide fondé sur la construction des voisinages réductibles , 1978 .

[19]  Clark F. Olson,et al.  Parallel Algorithms for Hierarchical Clustering , 1995, Parallel Comput..

[20]  Nitin Garg,et al.  Analysis of k-Means++ for Separable Data , 2012, APPROX-RANDOM.

[21]  Benjamin Moseley,et al.  Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-means, and Local Search , 2017, NIPS.

[22]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[23]  Sivaraman Balakrishnan,et al.  Efficient Active Algorithms for Hierarchical Clustering , 2012, ICML.

[24]  Michael I. Jordan,et al.  MAD-Bayes: MAP-based Asymptotic Derivations from Bayes , 2012, ICML.

[25]  Michael I. Jordan,et al.  Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models , 2012, NIPS.

[26]  Robert D. Nowak,et al.  Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities , 2011, AISTATS.

[27]  Grigory Yaroslavtsev,et al.  Massively Parallel Algorithms and Hardness for Single-Linkage Clustering Under $\ell_p$-Distances , 2017, ICML.

[28]  Dmitry Malioutov,et al.  Scalable Exemplar Clustering and Facility Location via Augmented Block Coordinate Descent with Column Generation , 2016, AISTATS.

[29]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[30]  Santosh S. Vempala,et al.  A discriminative framework for clustering via similarity functions , 2008, STOC.

[31]  Claire Mathieu,et al.  Hierarchical Clustering , 2017, SODA.

[32]  Moses Charikar,et al.  Hierarchical Clustering better than Average-Linkage , 2019, SODA.

[33]  Eric P. Xing,et al.  Dependent nonparametric trees for dynamic hierarchical clustering , 2014, NIPS.

[34]  Alok N. Choudhary,et al.  A Scalable Hierarchical Clustering Algorithm Using Spark , 2015, 2015 IEEE First International Conference on Big Data Computing Service and Applications.

[35]  Douglas A. Reynolds,et al.  The NIST 2014 Speaker Recognition i-vector Machine Learning Challenge , 2014, Odyssey.

[36]  Mark Dredze,et al.  Robust Entity Clustering via Phylogenetic Inference , 2014, ACL.

[37]  Silvio Lattanzi,et al.  Affinity Clustering: Hierarchical Clustering at Scale , 2017, NIPS.

[38]  Varun Kanade,et al.  Hierarchical Clustering Beyond the Worst-Case , 2017, NIPS.

[39]  Akshay Krishnamurthy,et al.  A Hierarchical Algorithm for Extreme Clustering , 2017, KDD.

[40]  Gregory W. Schwartz,et al.  TooManyCells identifies and visualizes relationships of single-cell clades , 2019, Nature Methods.

[41]  Shai Ben-David,et al.  Finding Meaningful Cluster Structure Amidst Background Noise , 2016, ALT.

[42]  Haixun Wang,et al.  Automatic Taxonomy Construction from Keywords via Scalable Bayesian Rose Trees , 2015, IEEE Transactions on Knowledge and Data Engineering.

[43]  Nicole Immorlica,et al.  Approximation, Randomization, and Combinatorial Optimization.. Algorithms and Techniques , 2003, Lecture Notes in Computer Science.

[44]  Michael I. Jordan,et al.  Optimistic Concurrency Control for Distributed Unsupervised Learning , 2013, NIPS.

[45]  Andreas Krause,et al.  Approximate K-Means++ in Sublinear Time , 2016, AAAI.

[46]  Andreas Krause,et al.  Coresets for Nonparametric Estimation - the Case of DP-Means , 2015, ICML.

[47]  Heiko Röglin,et al.  Analysis of Ward's Method , 2019, SODA.

[48]  Katherine A. Heller,et al.  Bayesian hierarchical clustering , 2005, ICML.

[49]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[50]  Grigory Yaroslavtsev,et al.  Hierarchical Clustering for Euclidean Data , 2018, AISTATS.

[51]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[52]  Jackie Chi Kit Cheung,et al.  Resolving Event Coreference with Supervised Representation Learning and Clustering-Oriented Regularization , 2018, *SEM@NAACL-HLT.

[53]  Ohad Shamir,et al.  Spectral Clustering on a Budget , 2011, AISTATS.

[54]  Mark Dredze,et al.  Entity Clustering Across Languages , 2012, NAACL.

[55]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[56]  Ian Davidson,et al.  Formalizing Hierarchical Clustering as Integer Linear Programming , 2013, AAAI.

[57]  Sanjoy Dasgupta,et al.  A cost function for similarity-based hierarchical clustering , 2015, STOC.

[58]  Neelima Gupta,et al.  PBIRCH: A Scalable Parallel Clustering algorithm for Incremental Data , 2006, 2006 10th International Database Engineering and Applications Symposium (IDEAS'06).

[59]  Andrew McCallum,et al.  Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space , 2019, KDD.

[60]  Chee Keong Kwoh,et al.  SparseHC: A Memory-efficient Online Hierarchical Clustering Algorithm , 2014, ICCS.

[61]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[62]  A. A. Sampler Canopy — Fast Sampling with Cover Trees , 2017 .

[63]  Ira Assent,et al.  The ClusTree: indexing micro-clusters for anytime stream mining , 2011, Knowledge and Information Systems.

[64]  Maria-Florina Balcan,et al.  Robust hierarchical clustering , 2013, J. Mach. Learn. Res..

[65]  Dingkang Wang,et al.  An Improved Cost Function for Hierarchical Cluster Trees , 2018, J. Comput. Geom..