Fitting tree metrics: Hierarchical clustering and phylogeny

Given dissimilarity data on pairs of objects in a set, we study the problem of fitting a tree metric to this data so as to minimize additive error (i.e. some measure of the difference between the tree metric and the given data). This problem arises in constructing an M-level hierarchical clustering of objects (or an ultrametric on objects) so as to match the given dissimilarity data - a basic problem in statistics. Viewed in this way, the problem is a generalization of the correlation clustering problem (which corresponds to M = 1). We give a very simple randomized combinatorial algorithm for the M-level hierarchical clustering problem that achieves an approximation ratio of M+2. This is a generalization of a previous factor 3 algorithm for correlation clustering on complete graphs. The problem of fitting tree metrics also arises in phylogeny where the objective is to learn the evolution tree by fitting a tree to dissimilarity data on taxa. The quality of the fit is measured by taking the l/sub p/ norm of the difference between the tree metric constructed and the given data. Previous results obtained a factor 3 approximation for finding the closest tree tree metric under the l/spl infin/ norm. No nontrivial approximation for general l/sub p/ norms was known before. We present a novel LP formulation for this problem and obtain an O((log n log log n)/sup 1/p/) approximation using this. Enroute, we obtain an O((log n log log n)/sup 1/p/) approximation for the closest ultrametric under the l/sub p/ norm. Our techniques are based on representing and viewing an ultrametric as a hierarchy of clusterings, and may be useful in other contexts.

[1]  H. Wareham On the computational complexity of inferring evolutionary trees , 1992 .

[2]  Mikkel Thorup,et al.  On the approximability of numerical taxonomy (fitting distances by tree metrics) , 1996, SODA '96.

[3]  Bin Ma,et al.  Fitting Distances by Tree Metrics with Increment Error , 1999, J. Comb. Optim..

[4]  Junhyong Kim,et al.  Tutorial on Phylogenetic Tree Estimation , 1999, ISMB 1999.

[5]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[6]  Steven Skiena,et al.  Integrating microarray data by consensus clustering , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[7]  Satish Rao,et al.  A tight bound on approximating arbitrary metrics by tree metrics , 2003, STOC '03.

[8]  Federico Ardila Subdominant Matroid Ultrametrics , 2004, math/0404370.

[9]  Kedar Dhamdhere Approximating Additive Distortion of Embeddings into Line Metrics , 2004, APPROX-RANDOM.

[10]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[11]  Elchanan Mossel,et al.  Learning nonsingular phylogenies and hidden Markov models , 2005, STOC '05.

[12]  Barbara R. Holland,et al.  Delta additive and Delta ultra-additive maps, Gromov's trees, and the Farris transform , 2005, Discret. Appl. Math..

[13]  Venkatesan Guruswami,et al.  Clustering with qualitative information , 2005, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[14]  Sampath Kannan,et al.  Approximating the Best-Fit Tree Under Lp Norms , 2005, APPROX-RANDOM.

[15]  Sampath Kannan,et al.  A robust model for finding optimal evolutionary trees , 1993, Algorithmica.