A few logs suffice to build (almost) all trees (I)

A phylogenetic tree (also called an "evolutionary tree") is a leaf-labelled tree which represents the evolutionary history for a set of species, and the construction of such trees is a fundamental problem in biology. Here we address the issue of how many sequence sites are required in order to recover the tree with high probability when the sites evolve under standard Markov-style i.i.d. mutation models. We provide analytic upper and lower bounds for the required sequence length, by developing a new (and polynomial time) algorithm. In particular we show that when the mutation probabilities are bounded the required sequence length can grow surprisingly slowly (a power of log n) in the number n of sequences, for almost all trees. A few logs suffice to build (almost) all trees (I) Peter L. Erdos Mathematical Institute of the Hungarian Academy of Sciences Budapest P.0.Box 127, Hungary-1364 E-mail: elp@math-inst.hu Michael A. Steel Biomathematics Research Centre University of Canterbury Christchurch, New Zealand E-mail: m. steel@math. canterbury. ac. nz Tandy J. Warnow Laszlo A. Szekely Department of Mathematics University of South Carolina Columbia, South Carolina E-mail: laszlo@math.sc. edu Department of Computer and Information Science University of Pennsylvania, Philadelphia PA E-mail: tandy@central. cis. upenn. edu

[1]  M. Steel,et al.  Extension Operations on Sets of Leaf-Labeled Trees , 1995 .

[2]  D. Aldous PROBABILITY DISTRIBUTIONS ON CLADOGRAMS , 1996 .

[3]  Nicholas C. Wormald,et al.  On the Distribution of Lengths of Evolutionary Trees , 1990, SIAM J. Discret. Math..

[4]  J. A. Cavender Taxonomy with confidence , 1978 .

[5]  J. Huelsenbeck Performance of Phylogenetic Methods in Simulation , 1995 .

[6]  Sampath Kannan,et al.  Efficient algorithms for inverting evolution , 1999, JACM.

[7]  J. Huelsenbeck,et al.  Hobgoblin of phylogenetics? , 1994, Nature.

[8]  M. Steel Recovering a tree from the leaf colourations it generates under a Markov model , 1994 .

[9]  M. Steel The complexity of reconstructing trees from qualitative characters and subtrees , 1992 .

[10]  A. Dress,et al.  Reconstructing the shape of a tree from observed dissimilarity data , 1986 .

[11]  P. Erdös,et al.  Local Quartet Splits of a Binary Tree Infer All Quartet Splits Via One Dyadic Inference Rule , 1996, Comput. Artif. Intell..

[12]  J. Farris A Probability Model for Inferring Evolutionary Trees , 1973 .

[13]  László A. Székely,et al.  Reconstructing Trees When Sequence Sites Evolve at Variable Rates , 1994, J. Comput. Biol..

[14]  Tandy J. Warnow,et al.  Parsimony is Hard to Beat , 1997, COCOON.

[15]  M. Hendy The Relationship Between Simple Evolutionary Tree Models and Observable Sequence Data , 1989 .

[16]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[17]  David Sankoff,et al.  COMPUTATIONAL COMPLEXITY OF INFERRING PHYLOGENIES BY COMPATIBILITY , 1986 .

[18]  P. Buneman The Recovery of Trees from Measures of Dissimilarity , 1971 .

[19]  T. Warnow Combinatorial algorithms for constructing phylogenetic trees , 1992 .

[20]  N. Saitou,et al.  Relative Efficiencies of the Fitch-Margoliash, Maximum-Parsimony, Maximum-Likelihood, Minimum-Evolution, and Neighbor-joining Methods of Phylogenetic Tree Construction in Obtaining the Correct Tree , 1989 .

[21]  Olivier Gascuel,et al.  Inferring evolutionary trees with strong combinatorial evidence , 2000, Theor. Comput. Sci..

[22]  W. H. Day Computational complexity of inferring phylogenies from dissimilarity matrices. , 1987, Bulletin of mathematical biology.

[23]  Joseph T. Chang,et al.  Reconstruction of Evolutionary Trees from Pairwise Distributions on Current Species , 1992 .

[24]  M. Marcus,et al.  A Survey of Matrix Theory and Matrix Inequalities , 1965 .

[25]  D. Hillis Inferring complex phylogenies. , 1996, Nature.

[26]  David Penny,et al.  Comparing Trees with Pendant Vertices Labelled , 1984 .

[27]  E. Harding The probabilities of rooted tree-shapes generated by random bifurcation , 1971, Advances in Applied Probability.

[28]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[29]  J. Neyman MOLECULAR STUDIES OF EVOLUTION: A SOURCE OF NOVEL STATISTICAL PROBLEMS* , 1971 .

[30]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[31]  M. Steel,et al.  A Few Logs Suuce to Build Almost All Trees Ii , 1997 .

[32]  László A. Székely,et al.  The number of nucleotide sites needed to accurately reconstructlarge evolutionary trees , 1996 .

[33]  J. Huelsenbeck,et al.  SUCCESS OF PHYLOGENETIC METHODS IN THE FOUR-TAXON CASE , 1993 .

[34]  S J Willson Measuring inconsistency in phylogenetic trees. , 1998, Journal of theoretical biology.

[35]  K. Strimmer,et al.  Quartet Puzzling: A Quartet Maximum-Likelihood Method for Reconstructing Tree Topologies , 1996 .

[36]  Kevin Atteson,et al.  The Performance of Neighbor-Joining Algorithms of Phylogeny Recronstruction , 1997, COCOON.

[37]  Andrey A. Zharkikh,et al.  Inconsistency of the Maximum-parsimony Method: the Case of Five Taxa With a Molecular Clock , 1993 .

[38]  Mikkel Thorup,et al.  On the approximability of numerical taxonomy (fitting distances by tree metrics) , 1996, SODA '96.

[39]  K. Strimmer,et al.  Bayesian Probabilities and Quartet Puzzling , 1997 .

[40]  D. Hillis Approaches for Assessing Phylogenetic Accuracy , 1995 .

[41]  James K. M. Brown Probabilities of Evolutionary Trees , 1994 .

[42]  M. Kimura Estimation of evolutionary distances between homologous nucleotide sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Andris Ambainis,et al.  Nearly tight bounds on the learnability of evolution , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.