The number of nucleotide sites needed to accurately reconstructlarge evolutionary trees

Biologists seek to reconstruct evolutionary trees for increasing number of species, $n$, from aligned genetic sequences. How fast the sequence length $N$ must grow, as a function of $n$, in order to accurately recover the underlying tree with probability $1-\epsilon$, if the sequences evolve according to simple stochastic models of nucleotide substitution? We show that for a certain model, a reconstruction method exists for which the sequence length $N$ can grow surprisingly slowly with $n$ (sublinearly for a wide range of parameters, and even as a power of $\log n$ in a narrow range, which roughly meets the lower bound from information theory). By contrast a more traditional technique (maximum compatibility) provably requires $N$ to grow faster than linearly in $n$. Our approach is based on a new, and computationally efficient approach for reconstructing phylogenetic trees from aligned DNA sequences.

[1]  G A Churchill,et al.  Sample size for a phylogenetic inference. , 1992, Molecular biology and evolution.

[2]  P. Buneman The Recovery of Trees from Measures of Dissimilarity , 1971 .

[3]  J. Farris A Probability Model for Inferring Evolutionary Trees , 1973 .

[4]  F. Ayala Molecular systematics , 2004, Journal of Molecular Evolution.

[5]  M. Steel The complexity of reconstructing trees from qualitative characters and subtrees , 1992 .

[6]  J. Felsenstein,et al.  Invariants of phylogenies in a simple case with discrete states , 1987 .

[7]  H Philippe,et al.  How many nucleotides are required to resolve a phylogenetic problem? The use of a new statistical method applicable to available sequences. , 1994, Molecular phylogenetics and evolution.

[8]  T. Warnow Combinatorial algorithms for constructing phylogenetic trees , 1992 .

[9]  Judea Pearl,et al.  Structuring causal trees , 1986, J. Complex..

[10]  J. A. Cavender Taxonomy with confidence , 1978 .

[11]  N. Saitou,et al.  Maximum likelihood methods. , 1990, Methods in enzymology.

[12]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[13]  Nick Goldman,et al.  MAXIMUM LIKELIHOOD INFERENCE OF PHYLOGENETIC TREES, WITH SPECIAL REFERENCE TO A POISSON PROCESS MODEL OF DNA SUBSTITUTION AND TO PARSIMONY ANALYSES , 1990 .

[14]  László A. Székely,et al.  Reconstructing Trees When Sequence Sites Evolve at Variable Rates , 1994, J. Comput. Biol..

[15]  Joseph T. Chang,et al.  Reconstruction of Evolutionary Trees from Pairwise Distributions on Current Species , 1992 .

[16]  Andrey A. Zharkikh,et al.  Inconsistency of the Maximum-parsimony Method: the Case of Five Taxa With a Molecular Clock , 1993 .

[17]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[18]  M. Steel,et al.  Extension Operations on Sets of Leaf-Labeled Trees , 1995 .

[19]  A. Dress,et al.  Reconstructing the shape of a tree from observed dissimilarity data , 1986 .

[20]  L. Cavalli-Sforza,et al.  PHYLOGENETIC ANALYSIS: MODELS AND ESTIMATION PROCEDURES , 1967, Evolution; international journal of organic evolution.

[21]  M. Steel Recovering a tree from the leaf colourations it generates under a Markov model , 1994 .

[22]  M. Hendy The Relationship Between Simple Evolutionary Tree Models and Observable Sequence Data , 1989 .