Combinatorics of distance-based tree inference

Several popular methods for phylogenetic inference (or hierarchical clustering) are based on a matrix of pairwise distances between taxa (or any kind of objects): The objective is to construct a tree with branch lengths so that the distances between the leaves in that tree are as close as possible to the input distances. If we hold the structure (topology) of the tree fixed, in some relevant cases (e.g., ordinary least squares) the optimal values for the branch lengths can be expressed using simple combinatorial formulae. Here we define a general form for these formulae and show that they all have two desirable properties: First, the common tree reconstruction approaches (least squares, minimum evolution), when used in combination with these formulae, are guaranteed to infer the correct tree when given enough data (consistency); second, the branch lengths of all the simple (nearest neighbor interchange) rearrangements of a tree can be calculated, optimally, in quadratic time in the size of the tree, thus allowing the efficient application of hill climbing heuristics. The study presented here is a continuation of that by Mihaescu and Pachter on branch length estimation [Mihaescu R, Pachter L (2008) Proc Natl Acad Sci USA 105:13206–13211]. The focus here is on the inference of the tree itself and on providing a basis for novel algorithms to reconstruct trees from distances.

[1]  Olivier Gascuel,et al.  Robustness of Phylogenetic Inference Based on Minimum Evolution , 2010, Bulletin of mathematical biology.

[2]  Otto Optiz,et al.  Conceptual and Numerical Analysis of Data , 1989 .

[3]  Lior Pachter,et al.  Combinatorics of least-squares trees , 2008, Proceedings of the National Academy of Sciences.

[4]  Edward Susko,et al.  On inconsistency of the neighbor-joining, least squares, and minimum evolution estimation when substitution processes are incorrectly modeled. , 2004, Molecular biology and evolution.

[5]  Y. Pauplin Direct Calculation of a Tree Length Using a Distance Matrix , 2000, Journal of Molecular Evolution.

[6]  Mike Steel,et al.  A basic limitation on inferring phylogenies by pairwise sequence comparisons. , 2008, Journal of theoretical biology.

[7]  L. Cavalli-Sforza,et al.  PHYLOGENETIC ANALYSIS: MODELS AND ESTIMATION PROCEDURES , 1967, Evolution; international journal of organic evolution.

[8]  O. Gascuel,et al.  Consistency of Topological Moves Based on the Balanced Minimum Evolution Principle of Phylogenetic Inference , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  O. Gascuel,et al.  Neighbor-joining revealed. , 2006, Molecular biology and evolution.

[10]  Olivier Gascuel,et al.  Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution Principle , 2002, WABI.

[11]  Kevin Atteson,et al.  The Performance of Neighbor-Joining Methods of Phylogenetic Reconstruction , 1999, Algorithmica.

[12]  G. Soete A least squares algorithm for fitting additive trees to proximity data , 1983 .

[13]  Olivier Gascuel,et al.  On the consistency of the minimum evolution principle of phylogenetic inference , 2003, Discret. Appl. Math..

[14]  Ziheng Yang,et al.  Computational Molecular Evolution , 2006 .

[15]  O. Gascuel,et al.  Theoretical foundation of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting. , 2003, Molecular biology and evolution.

[16]  P. Waddell,et al.  Rapid Evaluation of Least-Squares and Minimum-Evolution Criteria on Phylogenetic Trees , 1998 .

[17]  S. Roch Toward Extracting All Phylogenetic Information from Matrices of Evolutionary Distances , 2010, Science.

[18]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[19]  Olivier Gascuel,et al.  Improving the efficiency of SPR moves in phylogenetic tree search methods based on maximum likelihood , 2005, Bioinform..

[20]  M. Nei,et al.  Theoretical foundation of the minimum-evolution method of phylogenetic inference. , 1993, Molecular biology and evolution.

[21]  O Gascuel,et al.  Strengths and limitations of the minimum evolution principle. , 2001, Systematic biology.

[22]  Stephen J. Willson,et al.  Consistent formulas for estimating the total lengths of trees , 2005, Discret. Appl. Math..

[23]  A. C. Aitken IV.—On Least Squares and Linear Combination of Observations , 1936 .

[24]  W. Fitch,et al.  Construction of phylogenetic trees. , 1967, Science.

[25]  L. Jin,et al.  Variances of the average numbers of nucleotide substitutions within and between populations. , 1989, Molecular biology and evolution.

[26]  M. Nei,et al.  A Simple Method for Estimating and Testing Minimum-Evolution Trees , 1992 .

[27]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[28]  Arndt von Haeseler,et al.  Shortest triplet clustering: reconstructing large phylogenies using representative sets , 2005, BMC Bioinformatics.

[29]  M. Bulmer Use of the Method of Generalized Least Squares in Reconstructing Phylogenies from Sequence Data , 1991 .

[30]  Ying Huang,et al.  Taxonomic evaluation of the Streptomyces griseus clade using multilocus sequence analysis and DNA-DNA hybridization, with proposal to combine 29 species and three subspecies as 11 genomic species. , 2010, International journal of systematic and evolutionary microbiology.

[31]  K. Kidd,et al.  Phylogenetic analysis: concepts and methods. , 1971, American journal of human genetics.

[32]  W. A. Beyer,et al.  A molecular sequence metric and evolutionary trees , 1974 .

[33]  F. Ayala Molecular systematics , 2004, Journal of Molecular Evolution.

[34]  N. Saitou,et al.  Relative Efficiencies of the Fitch-Margoliash, Maximum-Parsimony, Maximum-Likelihood, Minimum-Evolution, and Neighbor-joining Methods of Phylogenetic Tree Construction in Obtaining the Correct Tree , 1989 .