Phase transition in the sample complexity of likelihood-based phylogeny inference

Reconstructing evolutionary trees from molecular sequence data is a fundamental problem in computational biology. Stochastic models of sequence evolution are closely related to spin systems that have been extensively studied in statistical physics and that connection has led to important insights on the theoretical properties of phylogenetic reconstruction algorithms as well as the development of new inference methods. Here, we study maximum likelihood, a classical statistical technique which is perhaps the most widely used in phylogenetic practice because of its superior empirical accuracy. At the theoretical level, except for its consistency, that is, the guarantee of eventual correct reconstruction as the size of the input data grows, much remains to be understood about the statistical properties of maximum likelihood in this context. In particular, the best bounds on the sample complexity or sequence-length requirement of maximum likelihood, that is, the amount of data required for correct reconstruction, are exponential in the number, n, of tips—far from known lower bounds based on information-theoretic arguments. Here we close the gap by proving a new upper bound on the sequence-length requirement of maximum likelihood that matches up to constants the known lower bound for some standard models of evolution. More specifically, for the r-state symmetric model of sequence evolution on a binary phylogeny with bounded edge lengths, we show that the sequence-length requirement behaves logarithmically in n when the expected amount of mutation per edge is below what is known as the Kesten-Stigum threshold. In general, the sequence-length requirement is polynomial in n. Our results imply moreover that the maximum likelihood estimator can be computed efficiently on randomly generated data provided sequences are as above. Our main technical contribution, which may be of independent interest, relates the total variation distance between the leaf state distributions of two trees with a notion of combinatorial distance between the trees. In words we show in a precise quantitative manner that the more different two evolutionary trees are, the easier it is to distinguish their output.

[1]  M. Steel,et al.  Subtree Transfer Operations and Their Induced Metrics on Evolutionary Trees , 2001 .

[2]  Paul W. Goldberg,et al.  Evolutionary Trees Can be Learned in Polynomial Time in the Two-State General Markov Model , 2001, SIAM J. Comput..

[3]  Dominic Welsh,et al.  The polytope of win vectors , 1997 .

[4]  M. A. Steel,et al.  Submitted to the Annals of Applied Probability ON THE VARIATIONAL DISTANCE OF TWO TREES ∗ , 2022 .

[5]  Daniel H. Huson,et al.  Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction , 1999, J. Comput. Biol..

[6]  Tamir Tuller,et al.  Finding a maximum likelihood tree is hard , 2006, JACM.

[7]  Mike A. Steel,et al.  Phylogeny - discrete and random processes in evolution , 2016, CBMS-NSF regional conference series in applied mathematics.

[8]  Satish Rao,et al.  Fast Phylogeny Reconstruction Through Learning of Ancestral Sequences , 2008, Algorithmica.

[9]  Elchanan Mossel Phase transitions in phylogeny , 2003, Transactions of the American Mathematical Society.

[10]  Stephen A. Smith,et al.  Inferring and Postprocessing Huge Phylogenies , 2013 .

[11]  Y. Peres,et al.  Broadcasting on trees and the Ising model , 2000 .

[12]  S. Ravi Testing Statistical Hypotheses, 3rd edn by E. L. Lehmann and J. P. Romano , 2007 .

[13]  D. Ioffe On the extremality of the disordered state for the Ising model on the Bethe lattice , 1996 .

[14]  Albert Y. Zomaya,et al.  Biological Knowledge Discovery Handbook: Preprocessing, Mining and Postprocessing of Biological Data , 2013 .

[15]  Michael S. Waterman,et al.  Computational Genome Analysis: An Introduction , 2007 .

[16]  Lang Tong,et al.  A Large-Deviation Analysis of the Maximum-Likelihood Learning of Markov Tree Structures , 2009, IEEE Transactions on Information Theory.

[17]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[18]  Elchanan Mossel,et al.  On the Impossibility of Reconstructing Ancestral Data and Phylogenies , 2003, J. Comput. Biol..

[19]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[20]  Daniel G. Brown,et al.  Fast Phylogenetic Tree Reconstruction Using Locality-Sensitive Hashing , 2012, WABI.

[21]  Elchanan Mossel,et al.  The Kesten-Stigum Reconstruction Bound Is Tight for Roughly Symmetric Binary Channels , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[22]  Vincent Y. F. Tan,et al.  Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates , 2010, J. Mach. Learn. Res..

[23]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[24]  Elchanan Mossel,et al.  Robust reconstruction on trees is determined by the second eigenvalue , 2004, math/0406447.

[25]  Joseph T. Chang,et al.  Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. , 1996, Mathematical biosciences.

[26]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[27]  Jean L. Chang,et al.  Initial sequence of the chimpanzee genome and comparison with the human genome , 2005, Nature.

[28]  Elchanan Mossel,et al.  Evolutionary trees and the Ising model on the Bethe lattice: a proof of Steel’s conjecture , 2005, ArXiv.

[29]  Sébastien Roch,et al.  Sequence Length Requirement of Distance-Based Phylogeny Reconstruction: Breaking the Polynomial Barrier , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[30]  Elchanan Mossel,et al.  On the Inference of Large Phylogenies with Long Branches: How Long Is Too Long? , 2010, Bulletin of mathematical biology.

[31]  J. Farris A Probability Model for Inferring Evolutionary Trees , 1973 .

[32]  Elchanan Mossel,et al.  Phylogenies without Branch Bounds: Contracting the Short, Pruning the Deep , 2008, SIAM J. Discret. Math..

[33]  S. Roch Toward Extracting All Phylogenetic Information from Matrices of Evolutionary Distances , 2010, Science.

[34]  Elchanan Mossel Distorted Metrics on Trees and Phylogenetic Forests , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[35]  Tandy Warnow,et al.  Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation , 2017 .

[36]  László A. Székely,et al.  Inverting Random Functions II: Explicit Bounds for Discrete Maximum Likelihood Estimation, with Applications , 2002, SIAM J. Discret. Math..

[37]  Elchanan Mossel,et al.  Identifiability and inference of non-parametric rates-across-sites models on large-scale phylogenies , 2011, Journal of mathematical biology.

[38]  Olivier Gascuel,et al.  PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference , 2018 .

[39]  Sagi Snir,et al.  Fast and reliable reconstruction of phylogenetic trees with indistinguishable edges , 2012, Random Struct. Algorithms.

[40]  Elchanan Mossel,et al.  Survey: Information Flow on Trees , 2004 .

[41]  Joseph T. Chang,et al.  A signal-to-noise analysis of phylogeny estimation by neighbor-joining: Insufficiency of polynomial length sequences. , 2006, Mathematical biosciences.

[42]  Elchanan Mossel Reconstruction on Trees: Beating the Second Eigenvalue , 2001 .

[43]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[44]  Vincent Y. F. Tan,et al.  Learning Latent Tree Graphical Models , 2010, J. Mach. Learn. Res..

[45]  M. Steel Recovering a tree from the leaf colourations it generates under a Markov model , 1994 .

[46]  Elchanan Mossel,et al.  Phylogenetic mixtures: Concentration of measure in the large-tree limit , 2011, ArXiv.

[47]  Sébastien Roch,et al.  A short proof that phylogenetic tree reconstruction by maximum likelihood is hard , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[48]  J. A. Cavender Taxonomy with confidence , 1978 .

[49]  J. Neyman MOLECULAR STUDIES OF EVOLUTION: A SOURCE OF NOVEL STATISTICAL PROBLEMS* , 1971 .

[50]  Alexandr Andoni,et al.  Global Alignment of Molecular Sequences via Ancestral State Reconstruction , 2009, ICS.

[51]  Allan Sly,et al.  Reconstruction for the Potts model , 2009, STOC '09.

[52]  Elchanan Mossel,et al.  Learning nonsingular phylogenies and hidden Markov models , 2005, STOC '05.

[53]  H. Kesten,et al.  Additional Limit Theorems for Indecomposable Multidimensional Galton-Watson Processes , 1966 .

[54]  P. Erdös,et al.  A few logs suffice to build (almost) all trees (l): part I , 1997 .

[55]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[56]  Steven Skiena,et al.  Computational genome analysis , 2005 .

[57]  Tandy J. Warnow,et al.  A Few Logs Suffice to Build (almost) All Trees: Part II , 1999, Theor. Comput. Sci..

[58]  Constantinos Daskalakis,et al.  Alignment-Free Phylogenetic Reconstruction: Sample Complexity via a Branching Process Analysis , 2011, ArXiv.

[59]  A. Wald Note on the Consistency of the Maximum Likelihood Estimate , 1949 .