Coalescent-based species tree estimation: a stochastic Farris transform

The reconstruction of a species phylogeny from genomic data faces two significant hurdles: 1) the trees describing the evolution of each individual gene--i.e., the gene trees--may differ from the species phylogeny and 2) the molecular sequences corresponding to each gene often provide limited information about the gene trees themselves. In this paper we consider an approach to species tree reconstruction that addresses both these hurdles. Specifically, we propose an algorithm for phylogeny reconstruction under the multispecies coalescent model with a standard model of site substitution. The multispecies coalescent is commonly used to model gene tree discordance due to incomplete lineage sorting, a well-studied population-genetic effect. In previous work, an information-theoretic trade-off was derived in this context between the number of loci, $m$, needed for an accurate reconstruction and the length of the locus sequences, $k$. It was shown that to reconstruct an internal branch of length $f$, one needs $m$ to be of the order of $1/[f^{2} \sqrt{k}]$. That previous result was obtained under the molecular clock assumption, i.e., under the assumption that mutation rates (as well as population sizes) are constant across the species phylogeny. Here we generalize this result beyond the restrictive molecular clock assumption, and obtain a new reconstruction algorithm that has the same data requirement (up to log factors). Our main contribution is a novel reduction to the molecular clock case under the multispecies coalescent. As a corollary, we also obtain a new identifiability result of independent interest: for any species tree with $n \geq 3$ species, the rooted species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock.

[1]  L. Nakhleh,et al.  Computational approaches to species phylogeny inference and gene tree reconciliation. , 2013, Trends in ecology & evolution.

[2]  Robert D. Nowak,et al.  Data Requirement for Phylogenetic Inference from Multiple Loci: A New Distance Method , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  John A Rhodes,et al.  Species Tree Inference from Gene Splits by Unrooted STAR Methods. , 2018, IEEE/ACM transactions on computational biology and bioinformatics.

[4]  Paul W. Goldberg,et al.  Evolutionary Trees Can be Learned in Polynomial Time in the Two-State General Markov Model , 2001, SIAM J. Comput..

[5]  Constantinos Daskalakis,et al.  Alignment-Free Phylogenetic Reconstruction: Sample Complexity via a Branching Process Analysis , 2011, ArXiv.

[6]  Anand Bhaskar,et al.  DESCARTES' RULE OF SIGNS AND THE IDENTIFIABILITY OF POPULATION DEMOGRAPHIC MODELS FROM GENOMIC VARIATION DATA. , 2013, Annals of statistics.

[7]  Mike A. Steel,et al.  Likelihood-based tree reconstruction on a concatenation of alignments can be positively misleading , 2014, ArXiv.

[8]  Sampath Kannan,et al.  Efficient algorithms for inverting evolution , 1999, JACM.

[9]  Liang Liu,et al.  Maximum tree: a consistent estimator of the species tree , 2010, Journal of mathematical biology.

[10]  W. Maddison Gene Trees in Species Trees , 1997 .

[11]  Elchanan Mossel,et al.  On the Impossibility of Reconstructing Ancestral Data and Phylogenies , 2003, J. Comput. Biol..

[12]  Elchanan Mossel,et al.  Evolutionary trees and the Ising model on the Bethe lattice: a proof of Steel’s conjecture , 2005, ArXiv.

[13]  Elchanan Mossel,et al.  Phylogenies without Branch Bounds: Contracting the Short, Pruning the Deep , 2011, SIAM J. Discret. Math..

[14]  László A. Székely,et al.  Inverting Random Functions II: Explicit Bounds for Discrete Maximum Likelihood Estimation, with Applications , 2002, SIAM J. Discret. Math..

[15]  Dennis Pearl,et al.  Tangled trees: the challenge of inferring species trees from coalescent and noncoalescent genes. , 2012, Methods in molecular biology.

[16]  Ziheng Yang,et al.  Molecular Evolution: A Statistical Approach , 2014 .

[17]  Tandy Warnow,et al.  Evaluating Summary Methods for Multilocus Species Tree Estimation in the Presence of Incomplete Lineage Sorting. , 2016, Systematic biology.

[18]  Noah A Rosenberg,et al.  Gene tree discordance, phylogenetic inference and the multispecies coalescent. , 2009, Trends in ecology & evolution.

[19]  L. Kubatko,et al.  Inconsistency of phylogenetic estimates from concatenated data under coalescence. , 2007, Systematic biology.

[20]  Lior Pachter,et al.  Why Neighbor-Joining Works , 2006, Algorithmica.

[21]  Michael DeGiorgio,et al.  Robustness to divergence time underestimation when inferring species trees from estimated gene trees. , 2014, Systematic biology.

[22]  R. Graham,et al.  Unlikelihood that minimal phylogenies for a realistic biological study can be constructed in reasonable computational time , 1982 .

[23]  Alexandr Andoni,et al.  Global Alignment of Molecular Sequences via Ancestral State Reconstruction , 2009, ICS.

[24]  Satish Rao,et al.  Fast Phylogeny Reconstruction Through Learning of Ancestral Sequences , 2008, Algorithmica.

[25]  R. Durrett Probability: Theory and Examples , 1993 .

[26]  J. Degnan,et al.  Fast and consistent estimation of species trees using supermatrix rooted triples. , 2010, Molecular biology and evolution.

[27]  Tamir Tuller,et al.  Finding a maximum likelihood tree is hard , 2006, JACM.

[28]  David Bryant,et al.  Properties of consensus methods for inferring species trees from gene trees. , 2008, Systematic biology.

[29]  Elchanan Mossel,et al.  The Kesten-Stigum Reconstruction Bound Is Tight for Roughly Symmetric Binary Channels , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[30]  Mike A. Steel,et al.  Phylogeny - discrete and random processes in evolution , 2016, CBMS-NSF regional conference series in applied mathematics.

[31]  Sébastien Roch,et al.  A short proof that phylogenetic tree reconstruction by maximum likelihood is hard , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  Ziheng Yang,et al.  Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. , 2003, Genetics.

[33]  Robert D. Nowak,et al.  New sample complexity bounds for phylogenetic inference from multiple loci , 2014, 2014 IEEE International Symposium on Information Theory.

[34]  Elchanan Mossel,et al.  Can one hear the shape of a population history? , 2014, Theoretical population biology.

[35]  Tandy Warnow,et al.  On the Robustness to Gene Tree Estimation Error (or lack thereof) of Coalescent-Based Species Tree Methods. , 2015, Systematic biology.

[36]  John A Rhodes,et al.  Determining species tree topologies from clade probabilities under the coalescent. , 2011, Journal of theoretical biology.

[37]  Li Zhang,et al.  On the complexity of distance-based evolutionary tree reconstruction , 2003, SODA '03.

[38]  Kevin Atteson,et al.  The Performance of Neighbor-Joining Methods of Phylogenetic Reconstruction , 1999, Algorithmica.

[39]  Sagi Snir,et al.  Fast and reliable reconstruction of phylogenetic trees with indistinguishable edges , 2012, Random Struct. Algorithms.

[40]  Allan Sly,et al.  Phase transition in the sample complexity of likelihood-based phylogeny inference , 2015, 1508.01964.

[41]  Elchanan Mossel Distorted Metrics on Trees and Phylogenetic Forests , 2007, TCBB.

[42]  Elchanan Mossel,et al.  Incomplete Lineage Sorting: Consistent Phylogeny Estimation from Multiple Loci , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[43]  F. Delsuc,et al.  Phylogenomics and the reconstruction of the tree of life , 2005, Nature Reviews Genetics.

[44]  Scott V Edwards,et al.  Coalescent methods for estimating phylogenetic trees. , 2009, Molecular phylogenetics and evolution.

[45]  John A Rhodes,et al.  Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent , 2009, Journal of mathematical biology.

[46]  Steven Kelk,et al.  Phylogenetic Networks: Concepts, Algorithms and Applications , 2012 .

[47]  Laura Kubatko,et al.  Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. , 2014, Journal of theoretical biology.

[48]  Mike Steel,et al.  A basic limitation on inferring phylogenies by pairwise sequence comparisons. , 2008, Journal of theoretical biology.

[49]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[50]  Elchanan Mossel Phase transitions in phylogeny , 2003, Transactions of the American Mathematical Society.

[51]  Tandy J. Warnow,et al.  A Few Logs Suffice to Build (almost) All Trees: Part II , 1999, Theor. Comput. Sci..

[52]  C. Fefferman,et al.  Can one learn history from the allelic spectrum? , 2008, Theoretical population biology.

[53]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[54]  Mikkel Thorup,et al.  On the approximability of numerical taxonomy (fitting distances by tree metrics) , 1996, SODA '96.

[55]  B. Roos Binomial Approximation to the Poisson Binomial Distribution: The Krawtchouk Expansion , 2001 .

[56]  Sébastien Roch,et al.  An Analytical Comparison of Multilocus Methods Under the Multispecies Coalescent: The Three-Taxon Case , 2012, Pacific Symposium on Biocomputing.

[57]  Mike Steel,et al.  Phylogenetic mixtures on a single tree can mimic a tree of another topology. , 2007, Systematic biology.

[58]  Elchanan Mossel,et al.  Distance-based Species Tree Estimation: Information-Theoretic Trade-off between Number of Loci and Sequence Length under the Coalescent , 2015, APPROX-RANDOM.

[59]  Tandy Warnow,et al.  Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation , 2017 .