Obtaining highly accurate topology estimates of evolutionary trees from very short sequences

The evolutionary history of a set of species is represented by a phylogenetic tree, in other words, by a rooted, leaf-labelled tree, where internal nodes represent ancestral species and the leaves represent modern day species. Accurate (or even boundedly inaccurate) topology reconstructions of large and divergent trees has long been considered one of the major challenges in systematic biology. None of the polynomial time methods developed by the theoretical computer science community has been shown to outperform the popular Neighbor-Joining method used by systematic biologists, with respect to topology estimation. (However, preliminary experiments indicate that two new variants of Neighbor-Joining, Bio-NJ and Weighbor, do exhibit improved performance.) In this paper, we present a simple polynomial time method, the Disk-Covering Method (DCM), which boosts the performance of base phylogenetic methods. We analyze the performance of DCM-boosted distance methods under the general Markov model of evolution , and prove that, by using the DCM-boosted Bune-man method, for almost all trees, polylogarithmic length sequences suuce for complete accuracy with high probability , while polynomial length sequences always suuce. Our experimental study (based upon simulating sequence evolution on model trees, generating about 1000 datasets) con-rms these substantial reductions in error rates and extremely fast convergence rates. In particular, we report that DCM-boosted Neighbor-Joining has only 8% of the error of Neighbor-Joining under conditions that are adverse to Neighbor-Joining, and on some trees achieving acceptable error rates (less than 5% error in the topology estimation) from sequences of a few hundred nucleotides, while Neighbor-Joining needs more than 10K nucleotides to achieve the same level of accuracy. 1 Introduction The evolution of biomolecular sequences can be modeled as a Markov process operating on a rooted binary tree: A biomolecular sequence at the root of the tree \evolves down" the tree, with each edge of the tree introducing point mutations , thereby generating sequences at the leaves of the tree, each of the same length as the root sequence. The phylo-genetic tree reconstruction problem is to take the sequences that occur at the leaves of the tree, and infer, as accurately as possible, the tree that generated the sequences. The tree reconstruction problem has two objectives: rst, to recover the branching process, as represented by the rooted leaf-labelled topology of the evolutionary tree, and second, to estimate the parameters of the evolutionary process (the mutation probabilities on the edges of the tree, the rates of change across …

[1]  Daniel H. Huson,et al.  Hybrid tree reconstruction methods , 1999, JEAL.

[2]  Tandy J. Warnow,et al.  A Few Logs Suffice to Build (almost) All Trees: Part II , 1999, Theor. Comput. Sci..

[3]  Tandy J. Warnow,et al.  A few logs suffice to build (almost) all trees (I) , 1999, Random Struct. Algorithms.

[4]  Maria Luisa Bonet,et al.  Better methods for solving parsimony and compatibility , 1998, RECOMB '98.

[5]  Daniel H. Huson,et al.  SplitsTree: analyzing and visualizing evolutionary data , 1998, Bioinform..

[6]  Lusheng Wang,et al.  New uses for uniform lifted alignments , 1998, Mathematical Support for Molecular Biology.

[7]  Andris Ambainis,et al.  Nearly tight bounds on the learnability of evolution , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[8]  P. Erdös,et al.  A few logs suffice to build (almost) all trees (l): part I , 1997 .

[9]  M. Donoghue,et al.  Analyzing large data sets: rbcL 500 revisited. , 1997, Systematic biology.

[10]  Kevin Atteson,et al.  The Performance of Neighbor-Joining Algorithms of Phylogeny Recronstruction , 1997, COCOON.

[11]  Tandy J. Warnow,et al.  Parsimony is Hard to Beat , 1997, COCOON.

[12]  Olivier Gascuel,et al.  Inferring evolutionary trees with strong combinatorial evidence , 1997, Theor. Comput. Sci..

[13]  Tandy J. Warnow,et al.  Constructing Big Trees from Short Sequences , 1997, ICALP.

[14]  W. John Kress,et al.  Angiosperm Phylogeny Inferred from 18S Ribosomal DNA Sequences , 1997 .

[15]  Daniel H. Huson,et al.  SplitsTree-a program for analyzing and visualizing evolutionary data , 1997 .

[16]  A Few Logs Suuce to Build Almost All Trees Ii , 1997 .

[17]  K. Strimmer,et al.  Accuracy of neighbor joining for n-taxon trees , 1996 .

[18]  D. Hillis Inferring complex phytogenies , 1996, Nature.

[19]  Sampath Kannan,et al.  Efficient algorithms for inverting evolution , 1996, STOC '96.

[20]  Mikkel Thorup,et al.  On the approximability of numerical taxonomy (fitting distances by tree metrics) , 1996, SODA '96.

[21]  D. Hillis Inferring complex phylogenies. , 1996, Nature.

[22]  Arndt von Haeseler,et al.  PERFORMANCE OF THE MAXIMUM LIKELIHOOD, NEIGHBOR JOINING, AND MAXIMUM PARSIMONY METHODS WHEN SEQUENCE SITES ARE NOT INDEPENDENT , 1995 .

[23]  J. Huelsenbeck Performance of Phylogenetic Methods in Simulation , 1995 .

[24]  Sampath KannanyNovember Eecient Algorithms for Inverting Evolution , 1995 .

[25]  J. Felsenstein,et al.  A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. , 1994, Molecular biology and evolution.

[26]  J. Huelsenbeck,et al.  Application and accuracy of molecular phylogenies. , 1994, Science.

[27]  M. Steel Recovering a tree from the leaf colourations it generates under a Markov model , 1994 .

[28]  László A. Székely,et al.  Reconstructing Trees When Sequence Sites Evolve at Variable Rates , 1994, J. Comput. Biol..

[29]  J. Huelsenbeck,et al.  SUCCESS OF PHYLOGENETIC METHODS IN THE FOUR-TAXON CASE , 1993 .

[30]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[31]  P. Forterre,et al.  Universal tree of life , 1993, Nature.

[32]  M. Sogin,et al.  Universal tree of life , 1993, Nature.

[33]  D Gusfield,et al.  Efficient methods for multiple sequence alignment with guaranteed error bounds , 1993, Bulletin of mathematical biology.

[34]  Michael R. Fellows,et al.  Two Strikes Against Perfect Phylogeny , 1992, ICALP.

[35]  A. Dress,et al.  A canonical decomposition theory for metrics on a finite set , 1992 .

[36]  M. Ruvolo,et al.  Geographic Origins of Human Mitochondrial DNA: Phylogenetic Evidence from Control Region Sequences , 1992 .

[37]  J Hein,et al.  A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences, when the phylogeny is given. , 1989, Molecular biology and evolution.

[38]  N. Saitou,et al.  Relative Efficiencies of the Fitch-Margoliash, Maximum-Parsimony, Maximum-Likelihood, Minimum-Evolution, and Neighbor-joining Methods of Phylogenetic Tree Construction in Obtaining the Correct Tree , 1989 .

[39]  M. Nei,et al.  Relative efficiencies of the maximum parsimony and distance-matrix methods in obtaining the correct phylogenetic tree. , 1988, Molecular biology and evolution.

[40]  J. Felsenstein Phylogenies from molecular sequences: inference and reliability. , 1988, Annual review of genetics.

[41]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[42]  A. Sugden Trends in ecology and evolution. , 1986, Trends in ecology & evolution.

[43]  W. H. Day Optimal algorithms for comparing trees with labeled leaves , 1985 .

[44]  M. Golumbic Algorithmic graph theory and perfect graphs , 1980 .

[45]  S. Jeffery Evolution of Protein Molecules , 1979 .

[46]  W. A. Beyer,et al.  Additive evolutionary trees. , 1977, Journal of theoretical biology.

[47]  Peter Buneman,et al.  A characterisation of rigid circuit graphs , 1974, Discret. Math..

[48]  P. Buneman The Recovery of Trees from Measures of Dissimilarity , 1971 .

[49]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .