论文信息 - Improving Ultrametrics Embeddings Through Coresets

Improving Ultrametrics Embeddings Through Coresets

To tackle the curse of dimensionality in data analysis and unsupervised learning, it is critical to be able to efficiently compute “simple” faithful representations of the data that helps extract information, improves understanding and visualization of the structure. When the dataset consists of ddimensional vectors, simple representations of the data may consist in trees or ultrametrics, and the goal is to best preserve the distances (i.e.: dissimilarity values) between data elements. To circumvent the quadratic running times of the most popular methods for fitting ultrametrics, such as average, single, or complete linkage, Cohen-Addad et al. (2020) recently presented a new algorithm that for any c ≥ 1, outputs in time n1+O(1/c2) an ultrametric ∆ such that for any two points u, v, ∆(u, v) is within a multiplicative factor of 5c to the distance between u and v in the “best” ultrametric representation. We improve the above result and show how to improve the above guarantee from 5c to √ 2c + ε while achieving the same asymptotic running time. To complement the improved theoretical bound, we additionally show that the performances of our algorithm are significantly better for various real-world datasets.

Rémi de Joannis de Verclos | Guillaume Lagarde | Vincent Cohen-Addad

[1] Noga Alon,et al. Hierarchical Clustering: a 0.585 Revenue Approximation , 2020, COLT.

[2] Guillaume Lagarde,et al. On Efficient Low Distortion Ultrametric Embedding , 2020, ICML.

[3] Sampath Kannan,et al. A robust model for finding optimal evolutionary trees , 1993, Algorithmica.

[4] Moses Charikar,et al. Approximate Hierarchical Clustering via Sparsest Cut and Spreading Metrics , 2016, SODA.

[5] Daniel Müllner,et al. fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python , 2013 .

[6] Benjamin Moseley,et al. Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-means, and Local Search , 2017, NIPS.

[7] Albert Gu,et al. From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering , 2020, NeurIPS.

[8] Kenneth L. Clarkson,et al. Smaller core-sets for balls , 2003, SODA '03.

[9] Sanjoy Dasgupta,et al. A cost function for similarity-based hierarchical clustering , 2015, STOC.

[10] Piotr Indyk,et al. Approximate clustering via core-sets , 2002, STOC '02.

[11] Grigory Yaroslavtsev,et al. Hierarchical Clustering for Euclidean Data , 2018, AISTATS.