An analytical upper bound on the number of loci required for all splits of a species tree to appear in a set of gene trees

BackgroundMany methods for species tree inference require data from a sufficiently large sample of genomic loci in order to produce accurate estimates. However, few studies have attempted to use analytical theory to quantify “sufficiently large”.ResultsUsing the multispecies coalescent model, we report a general analytical upper bound on the number of gene trees n required such that with probability q, each bipartition of a species tree is represented at least once in a set of n random gene trees. This bound employs a formula that is straightforward to compute, depends only on the minimum internal branch length of the species tree and the number of taxa, and applies irrespective of the species tree topology. Using simulations, we investigate numerical properties of the bound as well as its accuracy under the multispecies coalescent.ConclusionsOur results are helpful for conservatively bounding the number of gene trees required by the ASTRAL inference method, and the approach has potential to be extended to bound other properties of gene tree sets under the model.

[1]  F. Bokma Bayesian Estimation of Speciation and Extinction Probabilities from (In) Complete Phylogenies , 2008, Evolution; international journal of organic evolution.

[2]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[3]  John A Rhodes,et al.  Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent , 2009, Journal of mathematical biology.

[4]  B. Rannala,et al.  Phylogenetic inference using whole genomes. , 2008, Annual review of genomics and human genetics.

[5]  Liang Liu,et al.  Maximum tree: a consistent estimator of the species tree , 2010, Journal of mathematical biology.

[6]  J. Degnan Anomalous unrooted gene trees. , 2013, Systematic biology.

[7]  Noah A Rosenberg,et al.  The probability of topological concordance of gene trees and species trees. , 2002, Theoretical population biology.

[8]  J. Rhodes,et al.  There are no caterpillars in a wicked forest. , 2015, Theoretical population biology.

[9]  Laura Kubatko,et al.  Estimating species trees : practical and theoretical aspects , 2010 .

[10]  N. Rosenberg,et al.  Discordance of Species Trees with Their Most Likely Gene Trees , 2006, PLoS genetics.

[11]  Michael DeGiorgio,et al.  Robustness to divergence time underestimation when inferring species trees from estimated gene trees. , 2014, Systematic biology.

[12]  K. Hobson,et al.  Phylogeography and genetic structure of northern populations of the yellow warbler (Dendroica petechia) , 2000, Molecular ecology.

[13]  M. Nei,et al.  Relationships between gene trees and species trees. , 1988, Molecular biology and evolution.

[14]  Tandy J. Warnow,et al.  ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes , 2015, Bioinform..

[15]  C. G. Schrago The effective population sizes of the anthropoid ancestors of the human-chimpanzee lineage provide insights on the historical biogeography of the great apes. , 2014, Molecular biology and evolution.

[16]  S. Tavaré,et al.  Line-of-descent and genealogical processes, and their applications in population genetics models. , 1984, Theoretical population biology.

[17]  Yufeng Wu,et al.  COALESCENT‐BASED SPECIES TREE INFERENCE FROM GENE TREE TOPOLOGIES UNDER INCOMPLETE LINEAGE SORTING BY MAXIMUM LIKELIHOOD , 2012, Evolution; international journal of organic evolution.

[18]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[19]  Tandy J. Warnow,et al.  Algorithms for MDC-Based Multi-Locus Phylogeny Inference: Beyond Rooted Binary Gene Trees on Single Alleles , 2011, J. Comput. Biol..

[20]  Tandy J. Warnow,et al.  ASTRAL: genome-scale coalescent-based species tree estimation , 2014, Bioinform..

[21]  Noah A Rosenberg,et al.  Discordance of species trees with their most likely gene trees: the case of five taxa. , 2008, Systematic biology.

[22]  James H. Degnan,et al.  Species Tree Inference by the STAR Method and Its Generalizations , 2013, J. Comput. Biol..

[23]  Luay Nakhleh,et al.  Species Tree Inference by Minimizing Deep Coalescences , 2009, PLoS Comput. Biol..

[24]  David Bryant,et al.  The probability of monophyly of a sample of gene lineages on a species tree , 2016, Proceedings of the National Academy of Sciences.

[25]  Scott V Edwards,et al.  Estimating phylogenetic trees from genome‐scale data , 2015, Annals of the New York Academy of Sciences.

[26]  Elchanan Mossel,et al.  Incomplete Lineage Sorting: Consistent Phylogeny Estimation from Multiple Loci , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[27]  Noah A. Rosenberg,et al.  iGLASS: An Improvement to the GLASS Method for Estimating Species Trees from Gene Trees , 2012, J. Comput. Biol..

[28]  A. Drummond,et al.  Bayesian Inference of Species Trees from Multilocus Data , 2009, Molecular biology and evolution.

[29]  Sébastien Roch,et al.  An Analytical Comparison of Multilocus Methods Under the Multispecies Coalescent: The Three-Taxon Case , 2012, Pacific Symposium on Biocomputing.

[30]  Chung-I Wu,et al.  Inferences of species phylogeny in relation to segregation of ancient polymorphisms. , 1991, Genetics.

[31]  M. Steel,et al.  Distribution of branch lengths and phylogenetic diversity under homogeneous speciation models. , 2011, Journal of theoretical biology.

[32]  Robert D. Nowak,et al.  Data Requirement for Phylogenetic Inference from Multiple Loci: A New Distance Method , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  David Bryant,et al.  A classification of consensus methods for phylogenetics , 2001, Bioconsensus.

[34]  Liang Liu,et al.  Estimating species trees from unrooted gene trees. , 2011, Systematic biology.

[35]  B. Larget,et al.  Bayesian estimation of concordance among gene trees. , 2006, Molecular biology and evolution.

[36]  D. Pearl,et al.  Estimating species phylogenies using coalescence times among sequences. , 2009, Systematic biology.

[37]  N. Rosenberg,et al.  Does Gene Tree Discordance Explain the Mismatch between Macroevolutionary Models and Empirical Patterns of Tree Shape and Branching Times? , 2016, Systematic biology.

[38]  Noah A Rosenberg,et al.  Gene tree discordance, phylogenetic inference and the multispecies coalescent. , 2009, Trends in ecology & evolution.

[39]  David Bryant,et al.  Properties of consensus methods for inferring species trees from gene trees. , 2008, Systematic biology.

[40]  Michael T. Hallett,et al.  New algorithms for the duplication-loss model , 2000, RECOMB '00.