Phylogenetic information complexity: is testing a tree easier than finding it?

Phylogenetic trees describe the evolutionary history of a group of present-day species from a common ancestor. These trees are typically reconstructed from aligned DNA sequence data. In this paper we analytically address the following question: Is the amount of sequence data required to accurately reconstruct a tree significantly more than the amount required to test whether or not a candidate tree was the 'true' tree? By 'significantly', we mean that the two quantities do not behave the same way as a function of the number of species being considered. We prove that, for a certain type of model, the amount of information required is not significantly different; while for another type of model, the information required to test a tree is independent of the number of leaves, while that required to reconstruct it grows with this number. Our results combine probabilistic and combinatorial arguments.

[1]  László A. Székely,et al.  Teasing Apart Two Trees , 2007, Combinatorics, Probability and Computing.

[2]  G. Grimmett,et al.  Probability and random processes , 2002 .

[3]  László A. Székely,et al.  Inverting random functions , 1999 .

[4]  Ali Esmaili,et al.  Probability and Random Processes , 2005, Technometrics.

[5]  László A. Székely,et al.  Inverting Random Functions III: Discrete MLE Revisited , 2006 .

[6]  H Philippe,et al.  How many nucleotides are required to resolve a phylogenetic problem? The use of a new statistical method applicable to available sequences. , 1994, Molecular phylogenetics and evolution.

[7]  Victor A. Albert,et al.  Parsimony, phylogeny, and genomics , 2006 .

[8]  Mike A. Steel,et al.  Four Characters Suffice to Convexly Define a Phylogenetic Tree , 2005, SIAM J. Discret. Math..

[9]  László A. Székely,et al.  Inverting Random Functions II: Explicit Bounds for Discrete Maximum Likelihood Estimation, with Applications , 2002, SIAM J. Discret. Math..

[10]  T. Warnow,et al.  A STOCHASTIC MODEL OF LANGUAGE EVOLUTION THAT INCORPORATES HOMOPLASY AND BORROWING , 2005 .

[11]  T. Mexia,et al.  Author ' s personal copy , 2009 .

[12]  Charles Semple,et al.  Tree Reconstruction from Multi-State Characters , 2002, Adv. Appl. Math..

[13]  E. Hill Journal of Theoretical Biology , 1961, Nature.

[14]  Abraham Wald,et al.  On Distinct Hypotheses , 1949 .

[15]  P. Holland,et al.  Rare genomic changes as a tool for phylogenetics. , 2000, Trends in ecology & evolution.

[16]  P. Erdös,et al.  A few logs suffice to build (almost) all trees (l): part I , 1997 .

[17]  Elchanan Mossel,et al.  Optimal phylogenetic reconstruction , 2005, STOC '06.

[18]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[19]  Elchanan Mossel,et al.  How much can evolved characters tell us about the tree that generated them? , 2004, Mathematics of Evolution and Phylogeny.

[20]  Tandy J. Warnow,et al.  A Few Logs Suffice to Build (almost) All Trees: Part II , 1999, Theor. Comput. Sci..

[21]  J. Crow,et al.  THE NUMBER OF ALLELES THAT CAN BE MAINTAINED IN A FINITE POPULATION. , 1964, Genetics.

[22]  Masatoshi Nei,et al.  The number of nucleotides required to determine the branching order of three species, with special reference to the human-chimpanzee-gorilla divergence , 2005, Journal of Molecular Evolution.

[23]  M. Steel,et al.  Maximum parsimony and the phylogenetic information in multistate characters , 2006 .

[24]  D. Harris,et al.  How much data are needed to resolve a difficult phylogeny?: case study in Lamiales. , 2005, Systematic biology.

[25]  Elchanan Mossel,et al.  A phase transition for a random cluster model on phylogenetic trees. , 2004, Mathematical biosciences.

[26]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[27]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[28]  Olivier Gascuel,et al.  Mathematics of Evolution and Phylogeny , 2005 .

[29]  G A Churchill,et al.  Sample size for a phylogenetic inference. , 1992, Molecular biology and evolution.