Robustness to divergence time underestimation when inferring species trees from estimated gene trees.

To infer species trees from gene trees estimated from phylogenomic data sets, tractable methods are needed that can handle dozens to hundreds of loci. We examine several computationally efficient approaches-MP-EST, STAR, STEAC, STELLS, and STEM-for inferring species trees from gene trees estimated using maximum likelihood (ML) and Bayesian approaches. Among the methods examined, we found that topology-based methods often performed better using ML gene trees and methods employing coalescent times typically performed better using Bayesian gene trees, with MP-EST, STAR, STEAC, and STELLS outperforming STEM under most conditions. We examine why the STEM tree (also called GLASS or Maximum Tree) is less accurate on estimated gene trees by comparing estimated and true coalescence times, performing species tree inference using simulations, and analyzing a great ape data set keeping track of false positive and false negative rates for inferred clades. We find that although true coalescence times are more ancient than speciation times under the multispecies coalescent model, estimated coalescence times are often more recent than speciation times. This underestimation can lead to increased bias and lack of resolution with increased sampling (either alleles or loci) when gene trees are estimated with ML. The problem appears to be less severe using Bayesian gene-tree estimates.

[1]  Tanja Stadler,et al.  Simulating trees with a fixed number of extant species. , 2011, Systematic biology.

[2]  James H. Degnan,et al.  Evaluating Variations on the STAR Algorithm for Relative Efficiency and Sample Sizes Needed to Reconstruct Species Trees , 2012, Pacific Symposium on Biocomputing.

[3]  Erik Bloomquist,et al.  Inferring species-level phylogenies and taxonomic distinctiveness using multilocus data in Sistrurus rattlesnakes. , 2011, Systematic biology.

[4]  M. Coffey,et al.  Species Tree Estimation for the Late Blight Pathogen, Phytophthora infestans, and Close Relatives , 2012, PloS one.

[5]  R. T. Brumfield,et al.  Applications of next-generation sequencing to phylogeography and phylogenetics. , 2013, Molecular phylogenetics and evolution.

[6]  B. Nickel,et al.  Demographic History and Genetic Differentiation in Apes , 2006, Current Biology.

[7]  Tandy Warnow,et al.  SuperFine: fast and accurate supertree estimation. , 2012, Systematic biology.

[8]  Yufeng Wu,et al.  COALESCENT‐BASED SPECIES TREE INFERENCE FROM GENE TREE TOPOLOGIES UNDER INCOMPLETE LINEAGE SORTING BY MAXIMUM LIKELIHOOD , 2012, Evolution; international journal of organic evolution.

[9]  N. Levsen,et al.  Pleistocene speciation in the genus Populus (salicaceae). , 2012, Systematic biology.

[10]  John E McCormack,et al.  Maximum likelihood estimates of species trees: how accuracy of phylogenetic inference depends upon the divergence history and sampling design. , 2009, Systematic biology.

[11]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[12]  Korbinian Strimmer,et al.  APE: Analyses of Phylogenetics and Evolution in R language , 2004, Bioinform..

[13]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[14]  Bruce Rannala,et al.  The accuracy of species tree estimation under simulation: a comparison of methods. , 2011, Systematic biology.

[15]  Ziheng Yang,et al.  Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. , 2003, Genetics.

[16]  Bryan C. Carstens,et al.  SpedeSTEM: a rapid and accurate method for species delimitation , 2011, Molecular ecology resources.

[17]  Liang Liu,et al.  Maximum tree: a consistent estimator of the species tree , 2010, Journal of mathematical biology.

[18]  Qixin He,et al.  Sources of error inherent in species-tree estimation: impact of mutational and coalescent effects on accuracy and implications for choosing among different methods. , 2010, Systematic biology.

[19]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[20]  D. Pearl,et al.  Estimating species phylogenies using coalescence times among sequences. , 2009, Systematic biology.

[21]  Noah A. Rosenberg,et al.  iGLASS: An Improvement to the GLASS Method for Estimating Species Trees from Gene Trees , 2012, J. Comput. Biol..

[22]  Liang Liu,et al.  Phybase: an R package for species tree analysis , 2010, Bioinform..

[23]  L. Stein,et al.  Species trees from highly incongruent gene trees in rice. , 2009, Systematic biology.

[24]  Sen Song,et al.  Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model , 2012, Proceedings of the National Academy of Sciences.

[25]  J. Hintze,et al.  Violin plots : A box plot-density trace synergism , 1998 .

[26]  L. Knowles,et al.  What is the danger of the anomaly zone for empirical phylogenetics? , 2009, Systematic biology.

[27]  A. Drummond,et al.  Bayesian Inference of Species Trees from Multilocus Data , 2009, Molecular biology and evolution.

[28]  Luay Nakhleh,et al.  Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. , 2011, Systematic biology.

[29]  John P. Huelsenbeck,et al.  MrBayes 3: Bayesian phylogenetic inference under mixed models , 2003, Bioinform..

[30]  Laura Salter Kubatko,et al.  STEM: species tree estimation using maximum likelihood for gene trees under coalescence , 2009, Bioinform..

[31]  D. Wake,et al.  Species formation and geographical range evolution in a genus of Central American cloud forest salamanders (Dendrotriton) , 2012 .

[32]  M Steel,et al.  Links between maximum likelihood and maximum parsimony under a simple model of site substitution. , 1997, Bulletin of mathematical biology.

[33]  Elchanan Mossel,et al.  Incomplete Lineage Sorting: Consistent Phylogeny Estimation from Multiple Loci , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[34]  James H. Degnan,et al.  GENE TREE DISTRIBUTIONS UNDER THE COALESCENT PROCESS , 2005, Evolution; international journal of organic evolution.

[35]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[36]  S. Edwards,et al.  GENE DIVERGENCE , POPULATION DIVERGENCE , AND THE VARIANCE IN COALESCENCE TIME IN PHYLOGEOGRAPHIC STUDIES , 2001 .

[37]  Luay Nakhleh,et al.  PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships , 2008, BMC Bioinformatics.

[38]  W. Maddison,et al.  Inferring phylogeny despite incomplete lineage sorting. , 2006, Systematic biology.

[39]  Qixin He,et al.  Full modeling versus summarizing gene-tree uncertainty: method choice and species-tree accuracy. , 2012, Molecular phylogenetics and evolution.

[40]  Bryan C. Carstens,et al.  Rapid and accurate species tree estimation for phylogeographic investigations using replicated subsampling. , 2010, Molecular phylogenetics and evolution.

[41]  Scott V Edwards,et al.  A maximum pseudo-likelihood approach for estimating species trees under the coalescent model , 2010, BMC Evolutionary Biology.

[42]  Noah A. Rosenberg,et al.  Improvements to a Class of Distance Matrix Methods for Inferring Species Trees from Gene Trees , 2012, J. Comput. Biol..

[43]  Luay Nakhleh,et al.  The Probability of a Gene Tree Topology within a Phylogenetic Network with Applications to Hybridization Detection , 2012, PLoS genetics.

[44]  T. Britton,et al.  Estimating divergence times in large phylogenetic trees. , 2007, Systematic biology.

[45]  Luay Nakhleh,et al.  Species Tree Inference by Minimizing Deep Coalescences , 2009, PLoS Comput. Biol..

[46]  Travis C. Glenn,et al.  A Phylogeny of Birds Based on Over 1,500 Loci Collected by Target Enrichment and High-Throughput Sequencing , 2012, PloS one.

[47]  Sébastien Roch,et al.  An Analytical Comparison of Multilocus Methods Under the Multispecies Coalescent: The Three-Taxon Case , 2012, Pacific Symposium on Biocomputing.

[48]  D. Pearl,et al.  Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. , 2007, Systematic biology.

[49]  Ziheng Yang,et al.  Estimation of hominoid ancestral population sizes under bayesian coalescent models incorporating mutation rate variation and sequencing errors. , 2008, Molecular biology and evolution.