Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution Principle

The Minimum Evolution (ME) approach to phylogeny estimation has been shown to be statistically consistent when it is used in conjunction with ordinary least-squares (OLS) fitting of a metric to a tree structure. The traditional approach to using ME has been to start with the Neighbor Joining (NJ) topology for a given matrix and then do a topological search from that starting point. The first stage requires O(n(3)) time, where n is the number of taxa, while the current implementations of the second are in O(p n(3)) or more, where p is the number of swaps performed by the program. In this paper, we examine a greedy approach to minimum evolution which produces a starting topology in O(n(2)) time. Moreover, we provide an algorithm that searches for the best topology using nearest neighbor interchanges (NNIs), where the cost of doing p NNIs is O(n(2) + p n), i.e., O(n(2)) in practice because p is always much smaller than n. The Greedy Minimum Evolution (GME) algorithm, when used in combination with NNIs, produces trees which are fairly close to NJ trees in terms of topological accuracy. We also examine ME under a balanced weighting scheme, where sibling subtrees have equal weight, as opposed to the standard "unweighted" OLS, where all taxa have the same weight so that the weight of a subtree is equal to the number of its taxa. The balanced minimum evolution scheme (BME) runs slower than the OLS version, requiring O(n(2) x diam(T)) operations to build the starting tree and O(p n x diam(T)) to perform the NNIs, where diam(T) is the topological diameter of the output tree. In the usual Yule-Harding distribution on phylogenetic trees, the diameter expectation is in log(n), so our algorithms are in practice faster that NJ. Moreover, this BME scheme yields a very significant improvement over NJ and other distance-based algorithms, especially with large trees, in terms of topological accuracy.

[1]  M. Steel,et al.  Distributions of cherries for two models of trees. , 2000, Mathematical biosciences.

[2]  J. Felsenstein An alternating least squares approach to inferring phylogenies from pairwise distances. , 1997, Systematic biology.

[3]  L. Jin,et al.  Variances of the average numbers of nucleotide substitutions within and between populations. , 1989, Molecular biology and evolution.

[4]  P. Waddell,et al.  Rapid Evaluation of Least-Squares and Minimum-Evolution Criteria on Phylogenetic Trees , 1998 .

[5]  O. Gascuel,et al.  Efficient biased estimation of evolutionary distances when substitution rates vary across sites. , 2002, Molecular biology and evolution.

[6]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[7]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[8]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[9]  Y. Pauplin Direct Calculation of a Tree Length Using a Distance Matrix , 2000, Journal of Molecular Evolution.

[10]  Olivier Gascuel,et al.  On the consistency of the minimum evolution principle of phylogenetic inference , 2003, Discret. Appl. Math..

[11]  Miklós Csürös,et al.  Fast Recovery of Evolutionary Trees with Thousands of Nodes , 2001, J. Comput. Biol..

[12]  E. Harding The probabilities of rooted tree-shapes generated by random bifurcation , 1971, Advances in Applied Probability.

[13]  G. Yule,et al.  A Mathematical Theory of Evolution, Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .

[14]  J. Felsenstein,et al.  A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. , 1994, Molecular biology and evolution.

[15]  Olivier Gascuel,et al.  Concerning the NJ algorithm and its unweighted version, UNJ , 1996, Mathematical Hierarchies and Biology.

[16]  A. Halpern,et al.  Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. , 2000, Molecular biology and evolution.

[17]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[18]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[19]  D. Aldous Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today , 2001 .

[20]  O. Gascuel On the optimization principle in phylogenetic analysis and the minimum-evolution criterion. , 2000, Molecular biology and evolution.

[21]  W. Fitch,et al.  Construction of phylogenetic trees. , 1967, Science.

[22]  Tandy J. Warnow,et al.  A Few Logs Suffice to Build (almost) All Trees: Part II , 1999, Theor. Comput. Sci..

[23]  Robert E. Tarjan,et al.  Data structures and network algorithms , 1983, CBMS-NSF regional conference series in applied mathematics.

[24]  M. Bulmer Use of the Method of Generalized Least Squares in Reconstructing Phylogenies from Sequence Data , 1991 .

[25]  K. Kidd,et al.  Phylogenetic analysis: concepts and methods. , 1971, American journal of human genetics.

[26]  O Gascuel,et al.  Strengths and limitations of the minimum evolution principle. , 2001, Systematic biology.

[27]  M. Nei,et al.  Theoretical foundation of the minimum-evolution method of phylogenetic inference. , 1993, Molecular biology and evolution.