StarBEAST2 Brings Faster Species Tree Inference and Accurate Estimates of Substitution Rates

The multispecies coalescent (MSC) reconstructs species trees from a set of genes, and fully Bayesian MSC methods like *BEAST estimate species trees from multiple sequence alignments. Today thousands of genes can be sequenced for a given study, but using that many genes with *BEAST is intractably slow. One alternative is concatenation, which assumes that the evolutionary history of each gene tree is identical to the species tree. This is an inconsistent estimator of species tree topology, and a worse estimator of divergence times. Concatenation also induces spurious substitution rate variation when incomplete lineage sorting is present. Another alternative is to use summary MSC methods like ASTRAL, but such methods are also unsatisfactory because they infer branch lengths in coalescent units, and so cannot estimate divergence times. To enable fuller use of available data and more accurate inference of species tree topologies, divergence times, and substitution rates, we have developed a new version of *BEAST called StarBEAST2. To improve convergence rates we add analytical integration of population sizes, novel MCMC operators and other optimizations which improved computational performance 13.1 × to 13.8 × when analyzing empirical data sets, and an average of 33.1 × across 30 simulated data sets. To enable accurate estimates of per-species substitution rates we introduce species tree relaxed clocks, and show that StarBEAST2 is a more powerful and robust estimator of rate variation than concatenation. StarBEAST2 is available through the BEAUTi package manager in BEAST 2.4 and above.

[1]  Jun Yu,et al.  Genome sequencing of high-penicillin producing industrial strain of Penicillium chrysogenum , 2014, BMC Genomics.

[2]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[3]  G. Yule,et al.  A Mathematical Theory of Evolution, Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .

[4]  Hélène Morlon,et al.  Why does diversification slow down? , 2014, Trends in ecology & evolution.

[5]  S. Edwards,et al.  Comment on “Statistical binning enables an accurate coalescent-based estimation of the avian tree” , 2015, Science.

[6]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[7]  Christophe Andrieu,et al.  A tutorial on adaptive MCMC , 2008, Stat. Comput..

[8]  Dong Xie,et al.  BEAST 2: A Software Platform for Bayesian Evolutionary Analysis , 2014, PLoS Comput. Biol..

[9]  Scott V Edwards,et al.  A maximum pseudo-likelihood approach for estimating species trees under the coalescent model , 2010, BMC Evolutionary Biology.

[10]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[11]  Toni I. Gossmann,et al.  Selection-driven evolution of sex-biased genes is consistent with sexual selection in Arabidopsis thaliana. , 2014, Molecular biology and evolution.

[12]  M. Suchard,et al.  Bayesian random local clocks, or one rate to rule them all , 2010, BMC Biology.

[13]  Scott V Edwards,et al.  Estimating phylogenetic trees from genome‐scale data , 2015, Annals of the New York Academy of Sciences.

[14]  A. Lemmon,et al.  High-Throughput Genomic Data in Systematics and Phylogenetics , 2013 .

[15]  Philosophical Transactions of the Royal Society B: biological sciences , 2019 .

[16]  Md. Shamsuzzoha Bayzid,et al.  Statistical binning enables an accurate coalescent-based estimation of the avian tree , 2014, Science.

[17]  J. Wiens,et al.  When do species-tree and concatenated estimates disagree? An empirical analysis with higher-level scincid lizard phylogeny. , 2015, Molecular phylogenetics and evolution.

[18]  Effrey,et al.  Divergence Time and Evolutionary Rate Estimation with Multilocus Data , 2002 .

[19]  Matthew W. Hahn,et al.  Gene tree discordance causes apparent substitution rate variation , 2015, bioRxiv.

[20]  Mozes P. K. Blom,et al.  Convergence across a continent: adaptive diversification in a recent radiation of Australian lizards , 2016, Proceedings of the Royal Society B: Biological Sciences.

[21]  R. Bouckaert,et al.  Looking for trees in the forest: summary tree from posterior samples , 2013, BMC Evolutionary Biology.

[22]  J. Caldwell Demography and Life History of Two Species of Chorus Frogs (Anura: Hylidae) in South Carolina , 1987 .

[23]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[24]  Ziheng Yang,et al.  Unguided Species Delimitation Using DNA Sequence Data from Multiple Loci , 2014, Molecular biology and evolution.

[25]  Laura Salter Kubatko,et al.  Quartet Inference from SNP Data Under the Coalescent Model , 2014, Bioinform..

[26]  Lisa N. Barrow,et al.  Species tree estimation of North American chorus frogs (Hylidae: Pseudacris) with parallel tagged amplicon sequencing. , 2014, Molecular phylogenetics and evolution.

[27]  L. Kubatko,et al.  Inconsistency of phylogenetic estimates from concatenated data under coalescence. , 2007, Systematic biology.

[28]  Liang Liu,et al.  Genes with minimal phylogenetic information are problematic for coalescent analyses when gene tree estimation is biased. , 2015, Molecular phylogenetics and evolution.

[29]  Tandy Warnow,et al.  BBCA: Improving the scalability of *BEAST using random binning , 2014, BMC Genomics.

[30]  B. Rannala,et al.  Efficient Bayesian Species Tree Inference under the Multispecies Coalescent , 2015, Systematic biology.

[31]  Liang Liu,et al.  Estimating Species Trees Using Multiple-Allele DNA Sequence Data , 2008, Evolution; international journal of organic evolution.

[32]  É. Tannier,et al.  The Inference of Gene Trees with Species Trees , 2013, Systematic biology.

[33]  Robert M. Miura,et al.  Some mathematical questions in biology : DNA sequence analysis , 1986 .

[34]  O. Gascuel,et al.  New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. , 2010, Systematic biology.

[35]  A. Drummond,et al.  Bayesian Inference of Species Trees from Multilocus Data , 2009, Molecular biology and evolution.

[36]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[37]  John Wakeley,et al.  Estimating Divergence Times from Molecular Data on Phylogenetic and Population Genetic Timescales , 2002 .

[38]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[39]  R. Prum,et al.  A comprehensive multilocus phylogeny of the Neotropical cotingas (Cotingidae, Aves) with a comparative evolutionary analysis of breeding system and plumage dimorphism and a revised phylogenetic classification. , 2014, Molecular phylogenetics and evolution.

[40]  Tanja Gernhard,et al.  The conditioned reconstructed process. , 2008, Journal of theoretical biology.

[41]  G. Yule,et al.  A Mathematical Theory of Evolution Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .

[42]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[43]  S. Jeffery Evolution of Protein Molecules , 1979 .

[44]  Noah A Rosenberg,et al.  Gene tree discordance, phylogenetic inference and the multispecies coalescent. , 2009, Trends in ecology & evolution.

[45]  M. Springer,et al.  Concatenation versus coalescence versus “concatalescence” , 2013, Proceedings of the National Academy of Sciences.

[46]  Ziheng Yang The BPP program for species tree estimation and species delimitation , 2015 .

[47]  L. Bromham The genome as a life-history character: why rate of molecular evolution varies between mammal species , 2011, Philosophical Transactions of the Royal Society B: Biological Sciences.

[48]  W. Brown,et al.  EVOLUTION OF ANIMAL MITOCHONDRIAL DNA: RELEVANCE FOR POPULATION BIOLOGY AND SYSTEMATICS , 1987 .

[49]  Matthew D. Rasmussen,et al.  Accurate gene-tree reconstruction by learning gene- and species-specific substitution rates across multiple complete genomes. , 2007, Genome research.

[50]  Huw A. Ogilvie,et al.  Computational Performance and Statistical Accuracy of *BEAST and Comparisons with Other Methods , 2015, Systematic biology.

[51]  Jacob A. Esselstyn,et al.  The Challenges of Resolving a Rapid, Recent Radiation: Empirical and Simulated Phylogenomics of Philippine Shrews. , 2015, Systematic biology.

[52]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[53]  S. Ho,et al.  Relaxed Phylogenetics and Dating with Confidence , 2006, PLoS biology.

[54]  M. Miyamoto,et al.  Mutation rate variation in multicellular eukaryotes: causes and consequences , 2007, Nature Reviews Genetics.

[55]  Michael S. Y. Lee,et al.  Molecules, morphology, and ecology indicate a recent, amphibious ancestry for echidnas , 2009, Proceedings of the National Academy of Sciences.

[56]  N. Rosenberg,et al.  Discordance of Species Trees with Their Most Likely Gene Trees , 2006, PLoS genetics.

[57]  Tandy J. Warnow,et al.  ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes , 2015, Bioinform..

[58]  Heled Joseph biopy - a Library for Phylogenetic Exploration , 2013 .

[59]  Graham Jones,et al.  Algorithmic improvements to species delimitation and phylogeny estimation under the multispecies coalescent , 2017, Journal of mathematical biology.

[60]  Tandy J. Warnow,et al.  ASTRAL: genome-scale coalescent-based species tree estimation , 2014, Bioinform..

[61]  Nick Goldman,et al.  Statistical tests of models of DNA substitution , 1993, Journal of Molecular Evolution.

[62]  Alexandros Stamatakis,et al.  ExaBayes: Massively Parallel Bayesian Tree Inference for the Whole-Genome Era , 2014, Molecular biology and evolution.

[63]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[64]  Travis C Glenn,et al.  Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales. , 2012, Systematic biology.

[65]  A. Rambaut,et al.  BEAST: Bayesian evolutionary analysis by sampling trees , 2007, BMC Evolutionary Biology.

[66]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[67]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[68]  John Gatesy,et al.  The gene tree delusion. , 2016, Molecular phylogenetics and evolution.

[69]  Ziheng Yang,et al.  Challenges in Species Tree Estimation Under the Multispecies Coalescent Model , 2016, Genetics.

[70]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[71]  Liang Liu,et al.  BEST: Bayesian estimation of species trees under the coalescent model , 2008, Bioinform..