Improving Ultrametrics Embeddings Through Coresets

To tackle the curse of dimensionality in data analysis and unsupervised learning, it is critical to be able to efficiently compute “simple” faithful representations of the data that helps extract information, improves understanding and visualization of the structure. When the dataset consists of ddimensional vectors, simple representations of the data may consist in trees or ultrametrics, and the goal is to best preserve the distances (i.e.: dissimilarity values) between data elements. To circumvent the quadratic running times of the most popular methods for fitting ultrametrics, such as average, single, or complete linkage, Cohen-Addad et al. (2020) recently presented a new algorithm that for any c ≥ 1, outputs in time n1+O(1/c2) an ultrametric ∆ such that for any two points u, v, ∆(u, v) is within a multiplicative factor of 5c to the distance between u and v in the “best” ultrametric representation. We improve the above result and show how to improve the above guarantee from 5c to √ 2c + ε while achieving the same asymptotic running time. To complement the improved theoretical bound, we additionally show that the performances of our algorithm are significantly better for various real-world datasets.

[1]  Noga Alon,et al.  Hierarchical Clustering: a 0.585 Revenue Approximation , 2020, COLT.

[2]  Guillaume Lagarde,et al.  On Efficient Low Distortion Ultrametric Embedding , 2020, ICML.

[3]  Sampath Kannan,et al.  A robust model for finding optimal evolutionary trees , 1993, Algorithmica.

[4]  Moses Charikar,et al.  Approximate Hierarchical Clustering via Sparsest Cut and Spreading Metrics , 2016, SODA.

[5]  Daniel Müllner,et al.  fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python , 2013 .

[6]  Benjamin Moseley,et al.  Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-means, and Local Search , 2017, NIPS.

[7]  Albert Gu,et al.  From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering , 2020, NeurIPS.

[8]  Kenneth L. Clarkson,et al.  Smaller core-sets for balls , 2003, SODA '03.

[9]  Sanjoy Dasgupta,et al.  A cost function for similarity-based hierarchical clustering , 2015, STOC.

[10]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[11]  Grigory Yaroslavtsev,et al.  Hierarchical Clustering for Euclidean Data , 2018, AISTATS.

[12]  Bernd Gärtner,et al.  Fast Smallest-Enclosing-Ball Computation in High Dimensions , 2003, ESA.

[13]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[14]  Claire Mathieu,et al.  Hierarchical Clustering , 2017, SODA.

[15]  Moses Charikar,et al.  Hierarchical Clustering better than Average-Linkage , 2019, SODA.

[16]  Santosh S. Vempala,et al.  A discriminative framework for clustering via similarity functions , 2008, STOC.

[17]  Joseph S. B. Mitchell,et al.  Approximate minimum enclosing balls in high dimensions using core-sets , 2003, ACM J. Exp. Algorithmics.

[18]  Aurko Roy,et al.  Hierarchical Clustering via Spreading Metrics , 2016, NIPS.

[19]  Varun Kanade,et al.  Hierarchical Clustering Beyond the Worst-Case , 2017, NIPS.

[20]  Sara Ahmadian,et al.  Bisect and Conquer: Hierarchical Clustering via Max-Uncut Bisection , 2019, AISTATS.

[21]  Michael Cochez,et al.  Twister Tries: Approximate Hierarchical Agglomerative Clustering for Average Distance in Linear Time , 2015, SIGMOD Conference.

[22]  Piotr Indyk,et al.  Euclidean spanners in high dimensions , 2013, SODA.

[23]  Amir Abboud,et al.  Subquadratic High-Dimensional Hierarchical Clustering , 2019, NeurIPS.

[24]  Facundo Mémoli,et al.  Characterization, Stability and Convergence of Hierarchical Clustering Methods , 2010, J. Mach. Learn. Res..