Computational Performance and Statistical Accuracy of *BEAST and Comparisons with Other Methods

Under the multispecies coalescent model of molecular evolution, gene trees have independent evolutionary histories within a shared species tree. In comparison, supermatrix concatenation methods assume that gene trees share a single common genealogical history, thereby equating gene coalescence with species divergence. The multispecies coalescent is supported by previous studies which found that its predicted distributions fit empirical data, and that concatenation is not a consistent estimator of the species tree. *BEAST, a fully Bayesian implementation of the multispecies coalescent, is popular but computationally intensive, so the increasing size of phylogenetic data sets is both a computational challenge and an opportunity for better systematics. Using simulation studies, we characterize the scaling behavior of *BEAST, and enable quantitative prediction of the impact increasing the number of loci has on both computational performance and statistical accuracy. Follow-up simulations over a wide range of parameters show that the statistical performance of *BEAST relative to concatenation improves both as branch length is reduced and as the number of loci is increased. Finally, using simulations based on estimated parameters from two phylogenomic data sets, we compare the performance of a range of species tree and concatenation methods to show that using *BEAST with tens of loci can be preferable to using concatenation with thousands of loci. Our results provide insight into the practicalities of Bayesian species tree estimation, the number of loci required to obtain a given level of accuracy and the situations in which supermatrix or summary methods will be outperformed by the fully Bayesian multispecies coalescent.

[1]  N. Goldman,et al.  Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. , 1994, Molecular biology and evolution.

[2]  Norbert Zeh,et al.  Fast FPT Algorithms for Computing Rooted Agreement Forests: Theory and Experiments , 2010, SEA.

[3]  R. Bouckaert,et al.  Looking for trees in the forest: summary tree from posterior samples , 2013, BMC Evolutionary Biology.

[4]  Mike A. Steel,et al.  The expected length of pendant and interior edges of a Yule tree , 2009, Appl. Math. Lett..

[5]  Bruce Rannala,et al.  The accuracy of species tree estimation under simulation: a comparison of methods. , 2011, Systematic biology.

[6]  Travis C Glenn,et al.  Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales. , 2012, Systematic biology.

[7]  R. T. Brumfield,et al.  Applications of next-generation sequencing to phylogeography and phylogenetics. , 2013, Molecular phylogenetics and evolution.

[8]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[9]  G. Yule,et al.  A Mathematical Theory of Evolution, Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .

[10]  P. Etter,et al.  Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers , 2008, PloS one.

[11]  L. Kubatko Identifying hybridization events in the presence of coalescence via model selection. , 2009, Systematic biology.

[12]  M. Nei,et al.  Relationships between gene trees and species trees. , 1988, Molecular biology and evolution.

[13]  Jeet Sukumaran,et al.  DendroPy: a Python library for phylogenetic computing , 2010, Bioinform..

[14]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[15]  X.-Q. Wang,et al.  Extensive length variation in the cpDNA trnT-trnF region of hemiparasitic Pedicularis and its phylogenetic implications , 2007, Plant Systematics and Evolution.

[16]  R M May,et al.  Extinction rates can be estimated from molecular phylogenies. , 1994, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[17]  Luay Nakhleh,et al.  Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. , 2011, Systematic biology.

[18]  David Bryant,et al.  Simulating gene trees under the multispecies coalescent and time-dependent migration , 2013, BMC Evolutionary Biology.

[19]  Liang Liu,et al.  Estimating Species Trees Using Multiple-Allele DNA Sequence Data , 2008, Evolution; international journal of organic evolution.

[20]  Nathan M. Young,et al.  Primate molecular divergence dates. , 2006, Molecular phylogenetics and evolution.

[21]  M. Springer,et al.  Concatenation versus coalescence versus “concatalescence” , 2013, Proceedings of the National Academy of Sciences.

[22]  Todd H. Oakley,et al.  Evolutionary history and the effect of biodiversity on plant productivity , 2008, Proceedings of the National Academy of Sciences.

[23]  A. Drummond,et al.  Bayesian Inference of Species Trees from Multilocus Data , 2009, Molecular biology and evolution.

[24]  Colin N. Dewey,et al.  BUCKy: Gene tree/species tree reconciliation with Bayesian concordance analysis , 2010, Bioinform..

[25]  David Bryant,et al.  Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. , 2009, Molecular biology and evolution.

[26]  Páll Melsted,et al.  Comparative RNA sequencing reveals substantial genetic variation in endangered primates. , 2012, Genome research.

[27]  J. Wiens,et al.  Missing data in phylogenetic analysis: reconciling results from simulations and empirical data. , 2011, Systematic biology.

[28]  Laura Salter Kubatko,et al.  STEM: species tree estimation using maximum likelihood for gene trees under coalescence , 2009, Bioinform..

[29]  Tanja Gernhard,et al.  The conditioned reconstructed process. , 2008, Journal of theoretical biology.

[30]  D. Kendall On the Generalized "Birth-and-Death" Process , 1948 .

[31]  Bin Ma,et al.  From Gene Trees to Species Trees , 2000, SIAM J. Comput..

[32]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[33]  R. Prum,et al.  A comprehensive multilocus phylogeny of the Neotropical cotingas (Cotingidae, Aves) with a comparative evolutionary analysis of breeding system and plumage dimorphism and a revised phylogenetic classification. , 2014, Molecular phylogenetics and evolution.

[34]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[35]  S. Tavaré,et al.  Dating primate divergences through an integrated analysis of palaeontological and molecular data. , 2011, Systematic biology.

[36]  M. Steel,et al.  Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. , 2015, Theoretical population biology.

[37]  Dong Xie,et al.  BEAST 2: A Software Platform for Bayesian Evolutionary Analysis , 2014, PLoS Comput. Biol..

[38]  John Gatesy,et al.  The gene tree delusion. , 2016, Molecular phylogenetics and evolution.

[39]  Liang Liu,et al.  BEST: Bayesian estimation of species trees under the coalescent model , 2008, Bioinform..

[40]  M. Blaxter,et al.  Genome-wide genetic marker discovery and genotyping using next-generation sequencing , 2011, Nature Reviews Genetics.

[41]  C. Ané,et al.  Comparing two Bayesian methods for gene tree/species tree reconstruction: simulations with incomplete lineage sorting and horizontal gene transfer. , 2011, Systematic biology.

[42]  John Gatesy,et al.  Phylogenetic analysis at deep timescales: unreliable gene trees, bypassed hidden support, and the coalescence/concatalescence conundrum. , 2014, Molecular phylogenetics and evolution.

[43]  Tandy Warnow,et al.  Evaluating Summary Methods for Multilocus Species Tree Estimation in the Presence of Incomplete Lineage Sorting. , 2016, Systematic biology.

[44]  Ziheng Yang,et al.  Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. , 2003, Genetics.

[45]  A. Lemmon,et al.  Anchored hybrid enrichment for massively high-throughput phylogenomics. , 2012, Systematic biology.

[46]  L. Kubatko,et al.  Inconsistency of phylogenetic estimates from concatenated data under coalescence. , 2007, Systematic biology.

[47]  S. Tavaré,et al.  Primate Divergences through an Integrated Analysis of Palaeontological and Molecular Data , 2010 .

[48]  Tandy J. Warnow,et al.  ASTRAL: genome-scale coalescent-based species tree estimation , 2014, Bioinform..

[49]  D. Pearl,et al.  High-resolution species trees without concatenation , 2007, Proceedings of the National Academy of Sciences.

[50]  J. Degnan,et al.  Fast and consistent estimation of species trees using supermatrix rooted triples. , 2010, Molecular biology and evolution.

[51]  R. Page,et al.  From gene to organismal phylogeny: reconciled trees and the gene tree/species tree problem. , 1997, Molecular phylogenetics and evolution.

[52]  L. Knowles,et al.  How low can you go? The effects of mutation rate on the accuracy of species-tree estimation. , 2014, Molecular phylogenetics and evolution.

[53]  Scott V Edwards,et al.  A maximum pseudo-likelihood approach for estimating species trees under the coalescent model , 2010, BMC Evolutionary Biology.

[54]  S. Tavaré,et al.  Using the fossil record to estimate the age of the last common ancestor of extant primates , 2002, Nature.

[55]  S. Jeffery Evolution of Protein Molecules , 1979 .

[56]  Remco R. Bouckaert,et al.  DensiTree: making sense of sets of phylogenetic trees , 2010, Bioinform..

[57]  Alexei J Drummond,et al.  Guided tree topology proposals for Bayesian phylogenetic inference. , 2012, Systematic biology.

[58]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[59]  D. Pearl,et al.  Estimating species phylogenies using coalescence times among sequences. , 2009, Systematic biology.

[60]  Peter E Midford,et al.  Estimating a binary character's effect on speciation and extinction. , 2007, Systematic biology.

[61]  Tandy J. Warnow,et al.  Naive binning improves phylogenomic analyses , 2013, Bioinform..

[62]  Deren A. R. Eaton,et al.  Inferring Phylogeny and Introgression using RADseq Data: An Example from Flowering Plants (Pedicularis: Orobanchaceae) , 2013, Systematic biology.

[63]  M. Slatkin,et al.  SEARCHING FOR EVOLUTIONARY PATTERNS IN THE SHAPE OF A PHYLOGENETIC TREE , 1993, Evolution; international journal of organic evolution.

[64]  Catherine J. Wu,et al.  Applications of next-generation sequencing to blood and marrow transplantation. , 2012, Biology of blood and marrow transplantation : journal of the American Society for Blood and Marrow Transplantation.

[65]  Tandy J. Warnow,et al.  ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes , 2015, Bioinform..

[66]  N. Rosenberg,et al.  Discordance of Species Trees with Their Most Likely Gene Trees , 2006, PLoS genetics.

[67]  Graham Jones,et al.  DISSECT: an assignment-free Bayesian discovery method for species delimitation under the multispecies coalescent , 2014, bioRxiv.

[68]  R. Lanfear,et al.  The effects of partitioning on phylogenetic inference. , 2015, Molecular biology and evolution.

[69]  R. Page,et al.  How should species phylogenies be inferred from sequence data? , 1999, Systematic biology.

[70]  Patricia A. McLenachan,et al.  A Statistical Approach for Distinguishing Hybridization and Incomplete Lineage Sorting , 2009, The American Naturalist.

[71]  Ziheng Yang,et al.  The influence of gene flow on species tree estimation: a simulation study. , 2014, Systematic biology.

[72]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[73]  J. Good,et al.  Transcriptome-based exon capture enables highly cost-effective comparative genomic data collection at moderate evolutionary scales , 2012, BMC Genomics.

[74]  M. Steel,et al.  Distribution of branch lengths and phylogenetic diversity under homogeneous speciation models. , 2011, Journal of theoretical biology.

[75]  Laura Salter Kubatko,et al.  Quartet Inference from SNP Data Under the Coalescent Model , 2014, Bioinform..

[76]  Mariana Morando,et al.  Accuracy and precision of species trees: effects of locus, individual, and base pair sampling on inference of species trees in lizards of the Liolaemus darwinii group (Squamata, Liolaemidae). , 2012, Systematic biology.