Improved Error Bounds for Tree Representations of Metric Spaces

Estimating optimal phylogenetic trees or hierarchical clustering trees from metric data is an important problem in evolutionary biology and data analysis. Intuitively, the goodness-of-fit of a metric space to a tree depends on its inherent treeness, as well as other metric properties such as intrinsic dimension. Existing algorithms for embedding metric spaces into tree metrics provide distortion bounds depending on cardinality. Because cardinality is a simple property of any set, we argue that such bounds do not fully capture the rich structure endowed by the metric. We consider an embedding of a metric space into a tree proposed by Gromov. By proving a stability result, we obtain an improved additive distortion bound depending only on the hyperbolicity and doubling dimension of the metric. We observe that Gromov's method is dual to the well-known single linkage hierarchical clustering (SLHC) method. By means of this duality, we are able to transport our results to the setting of SLHC, where such additive distortion bounds were previously unknown.

[1]  Satish Rao,et al.  A tight bound on approximating arbitrary metrics by tree metrics , 2003, STOC '03.

[2]  Alain Guénoche,et al.  Trees and proximity representations , 1991, Wiley-Interscience series in discrete mathematics and optimization.

[3]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[4]  Edoardo M. Airoldi,et al.  Tree preserving embedding , 2011, Proceedings of the National Academy of Sciences.

[5]  Yi Li,et al.  Learnability and the doubling dimension , 2006, NIPS.

[6]  Yair Bartal,et al.  Probabilistic approximation of metric spaces and its algorithmic applications , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[7]  Susan Holmes,et al.  Computational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees , 2010, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[8]  Mikkel Thorup,et al.  On the approximability of numerical taxonomy (fitting distances by tree metrics) , 1996, SODA '96.

[9]  A. O. Houcine On hyperbolic groups , 2006 .

[10]  Ittai Abraham,et al.  Reconstructing approximate tree metrics , 2007, PODC '07.

[11]  Elena Deza,et al.  Encyclopedia of Distances , 2014 .

[12]  Ulrike von Luxburg,et al.  Uniqueness of Ordinal Embedding , 2014, COLT.

[13]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[14]  V. Chepoi,et al.  l ∞ -approximation via subdominants , 2000 .

[15]  Robert Krauthgamer,et al.  Navigating nets: simple algorithms for proximity search , 2004, SODA '04.

[16]  Facundo Mémoli,et al.  Characterization, Stability and Convergence of Hierarchical Clustering Methods , 2010, J. Mach. Learn. Res..

[17]  Feodor F. Dragan,et al.  Metric tree-like structures in real-life networks: an empirical study , 2014, ArXiv.

[18]  Sampath Kannan,et al.  A robust model for finding optimal evolutionary trees , 1993, Algorithmica.

[19]  Jirí Matousek,et al.  Low-Distortion Embeddings of Finite Metric Spaces , 2004, Handbook of Discrete and Computational Geometry, 2nd Ed..

[20]  Mirko Krvanek The Complexity of Ultrametric Partitions on Graphs , 1988, Inf. Process. Lett..