A biologist’s guide to Bayesian phylogenetic analysis

Bayesian methods have become very popular in molecular phylogenetics due to the availability of user-friendly software for running sophisticated models of evolution. However, Bayesian phylogenetic models are complex, and analyses are often carried out using default settings, which may not be appropriate. Here we summarize the major features of Bayesian phylogenetic inference and discuss Bayesian computation using Markov chain Monte Carlo (MCMC) sampling, the diagnosis of an MCMC run, and ways of summarizing the MCMC sample. We discuss the specification of the prior, the choice of the substitution model and partitioning of the data. Finally, we provide a list of common Bayesian phylogenetic software packages and recommend appropriate applications.Bayesian phylogenetic methods are very popular among evolutionary biologists and ecologists. This Review summarizes the major features of Bayesian inference and discusses several practical aspects of Bayesian computation.

[1]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[2]  John P. Huelsenbeck,et al.  The Phylogenetic Handbook: Bayesian phylogenetic analysis using MRBAYES , 2009 .

[3]  R. Butlin,et al.  Deformed wing virus is a recent global epidemic in honeybees driven by Varroa mites , 2016, Science.

[4]  Ari Löytynoja,et al.  Uniting Alignments and Trees , 2009, Science.

[5]  Loren H. Rieseberg,et al.  Gene trees and species trees are not the same , 2001 .

[6]  J. Huelsenbeck,et al.  Bayesian phylogenetic analysis of combined data. , 2004, Systematic biology.

[7]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[8]  J. Huelsenbeck,et al.  The fossilized birth–death process for coherent calibration of divergence-time estimates , 2013, Proceedings of the National Academy of Sciences.

[9]  Christophe Dessimoz,et al.  Inferring orthology and paralogy. , 2012, Methods in molecular biology.

[10]  M. Tristem,et al.  Evolution of endogenous retroviruses in the Suidae: evidence for different viral subpopulations in African and Eurasian host species , 2011, BMC Evolutionary Biology.

[11]  K. Holsinger,et al.  Polytomies and Bayesian phylogenetic inference. , 2005, Systematic biology.

[12]  M. Suchard,et al.  Joint Bayesian estimation of alignment and phylogeny. , 2005, Systematic biology.

[13]  J. Huelsenbeck,et al.  Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics. , 2008, Systematic biology.

[14]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[15]  Maxim Teslenko,et al.  MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice Across a Large Model Space , 2012, Systematic biology.

[16]  B. Rannala,et al.  Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference , 1996, Journal of Molecular Evolution.

[17]  Dong Xie,et al.  BEAST 2: A Software Platform for Bayesian Evolutionary Analysis , 2014, PLoS Comput. Biol..

[18]  A. Zharkikh Estimation of evolutionary distances between nucleotide sequences , 1994, Journal of Molecular Evolution.

[19]  H. Kishino,et al.  Estimating the rate of evolution of the rate of molecular evolution. , 1998, Molecular biology and evolution.

[20]  Peter Beerli,et al.  Comparison of Bayesian and maximum-likelihood inference of population genetic parameters , 2006, Bioinform..

[21]  Lena Osterhagen,et al.  Molecular Evolution A Statistical Approach , 2016 .

[22]  Md. Shamsuzzoha Bayzid,et al.  Whole-genome analyses resolve early branches in the tree of life of modern birds , 2014, Science.

[23]  M. Suchard,et al.  The early spread and epidemic ignition of HIV-1 in human populations , 2014, Science.

[24]  Marc A Suchard,et al.  Three roads diverged? Routes to phylogeographic inference. , 2010, Trends in ecology & evolution.

[25]  Ziheng Yang Estimating the pattern of nucleotide substitution , 1994, Journal of Molecular Evolution.

[26]  F. Werneck,et al.  Biogeographic history and cryptic diversity of saxicolous Tropiduridae lizards endemic to the semiarid Caatinga , 2015, BMC Evolutionary Biology.

[27]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[28]  B. Rannala,et al.  Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models. , 2004, Systematic biology.

[29]  D. Rabosky Automatic Detection of Key Innovations, Rate Shifts, and Diversity-Dependence on Phylogenetic Trees , 2014, PloS one.

[30]  Anne-Mieke Vandamme,et al.  The Phylogenetic Handbook: A Practical Approach to Phylogenetic Analysis and Hypothesis Testing , 2009 .

[31]  Ziheng Yang,et al.  Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. , 2003, Genetics.

[32]  Scott V Edwards,et al.  Implementing and testing the multispecies coalescent model: A valuable paradigm for phylogenomics. , 2016, Molecular phylogenetics and evolution.

[33]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[34]  Jonathan P. Bollback,et al.  Bayesian Inference of Phylogeny and Its Impact on Evolutionary Biology , 2001, Science.

[35]  Bin Ma,et al.  From Gene Trees to Species Trees , 2000, SIAM J. Comput..

[36]  R. Nichols,et al.  Gene trees and species trees are not the same. , 2001, Trends in ecology & evolution.

[37]  Jeremy M. Brown,et al.  When trees grow too long: investigating the causes of highly inaccurate bayesian branch-length estimates. , 2010, Systematic biology.

[38]  Mario dos Reis,et al.  Bayesian molecular clock dating of species divergences in the genomics era , 2015, Nature Reviews Genetics.

[39]  Ziheng Yang,et al.  Bayesian estimation of species divergence times under a molecular clock using multiple fossil calibrations with soft bounds. , 2006, Molecular biology and evolution.

[40]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[41]  Ziheng Yang The BPP program for species tree estimation and species delimitation , 2015 .

[42]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[43]  T. Stadler,et al.  Amazonia Through Time: Andean Uplift, Climate Change, Landscape Evolution, and Biodiversity , 2010, Science.

[44]  Thomas K. F. Wong,et al.  Phylogenomics resolves the timing and pattern of insect evolution , 2014, Science.

[45]  Alexei J Drummond,et al.  Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences. , 2006, Molecular biology and evolution.

[46]  Ziheng Yang,et al.  Uncertainty in the Timing of Origin of Animals and the Limits of Precision in Molecular Timescales , 2015, Current Biology.

[47]  P. Beerli STATISTICAL METHODS IN (MOLECULAR) EVOLUTION1 , 2006 .

[48]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[49]  Ziheng Yang,et al.  Molecular evolution of the hepatitis B virus genome , 1995, Journal of Molecular Evolution.

[50]  S. Ho,et al.  Relaxed Phylogenetics and Dating with Confidence , 2006, PLoS biology.

[51]  Gaston H. Gonnet,et al.  The OMA orthology database in 2015: function predictions, better plant support, synteny view and other improvements , 2014, Nucleic Acids Res..

[52]  Z. Yang,et al.  Among-site rate variation and its impact on phylogenetic analyses. , 1996, Trends in ecology & evolution.

[53]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[54]  P. Green,et al.  Metropolis Methods, Gaussian Proposals and Antithetic Variables , 1992 .

[55]  Gavin J. D. Smith,et al.  RNA Virus Reassortment: An Evolutionary Mechanism for Host Jumps and Immune Evasion , 2015, PLoS pathogens.

[56]  P. Lewis A likelihood approach to estimating phylogeny from discrete morphological character data. , 2001, Systematic biology.

[57]  Seraina Klopfstein,et al.  A Total-Evidence Approach to Dating with Fossils, Applied to the Early Radiation of the Hymenoptera , 2012, Systematic biology.

[58]  R. Lanfear,et al.  Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. , 2012, Molecular biology and evolution.

[59]  B. Rannala,et al.  Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo Method. , 1997, Molecular biology and evolution.

[60]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[61]  Ziheng Yang,et al.  Inferring speciation times under an episodic molecular clock. , 2007, Systematic biology.

[62]  W DimmicMatt,et al.  Markov Models of Protein Sequence Evolution , 2005 .

[63]  Michael J. Landis,et al.  RevBayes: Bayesian Phylogenetic Inference Using Graphical Models and an Interactive Model-Specification Language , 2016, Systematic biology.

[64]  C. Bonvicino,et al.  The Role of Historical Barriers in the Diversification Processes in Open Vegetation Formations during the Miocene/Pliocene Using an Ancient Rodent Lineage as a Model , 2013, PloS one.

[65]  S. Ho,et al.  Accounting for calibration uncertainty in phylogenetic estimation of evolutionary divergence times. , 2009, Systematic biology.

[66]  Nicolas Lartillot,et al.  PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating , 2009, Bioinform..

[67]  J. Felsenstein Maximum-likelihood estimation of evolutionary trees from continuous characters. , 1973, American journal of human genetics.

[68]  C. Cannings Statistical Methods in Molecular Evolution , 2006 .

[69]  Cédric Notredame,et al.  Multiple sequence alignment modeling: methods and applications , 2016, Briefings Bioinform..

[70]  T. J. Robinson,et al.  Impacts of the Cretaceous Terrestrial Revolution and KPg Extinction on Mammal Diversification , 2011, Science.

[71]  Tianqi Zhu,et al.  Tail paradox, partial identifiability, and influential priors in Bayesian branch length inference. , 2012, Molecular biology and evolution.

[72]  Jody Hey,et al.  Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics , 2007, Proceedings of the National Academy of Sciences.

[73]  Scott V Edwards,et al.  Estimating phylogenetic trees from genome‐scale data , 2015, Annals of the New York Academy of Sciences.

[74]  Ramón Doallo,et al.  CircadiOmics: integrating circadian genomics, transcriptomics, proteomics and metabolomics , 2012, Nature Methods.

[75]  É. Tannier,et al.  The Inference of Gene Trees with Species Trees , 2013, Systematic biology.

[76]  M. dos Reis,et al.  Dating Tips for Divergence-Time Estimation. , 2015, Trends in genetics : TIG.

[77]  N. Ben-Tal,et al.  Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior. , 2004, Molecular biology and evolution.

[78]  Forrest W. Crawford,et al.  Unifying the spatial epidemiology and molecular evolution of emerging epidemics , 2012, Proceedings of the National Academy of Sciences.

[79]  Bret Larget,et al.  Bayesian Phylogenetics: Methods, Algorithms and Applications , 2015 .

[80]  B. Rannala Identi(cid:142)ability of Parameters in MCMC Bayesian Inference of Phylogeny , 2002 .

[81]  James C. Wilgenbusch,et al.  AWTY (are we there yet?): a system for graphical exploration of MCMC convergence in Bayesian phylogenetics , 2008, Bioinform..

[82]  Jeffrey P. Townsend,et al.  A comprehensive phylogeny of birds (Aves) using targeted next-generation DNA sequencing , 2016, Nature.

[83]  D. Hillis,et al.  Modeling Character Change Heterogeneity in Phylogenetic Analyses of Morphology through the Use of Priors. , 2016, Systematic biology.

[84]  E. Teeling,et al.  Mammal madness: is the mammal tree of life not yet resolved? , 2016, Philosophical Transactions of the Royal Society B: Biological Sciences.

[85]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[86]  P. Lio’,et al.  Models of molecular evolution and phylogeny. , 1998, Genome research.

[87]  Ziheng Yang,et al.  Challenges in Species Tree Estimation Under the Multispecies Coalescent Model , 2016, Genetics.

[88]  Arnoldo Frigessi,et al.  Stochastic models, statistical methods, and algorithms in image analysis : proceedings of the special year on image analysis held in Rome, Italy, 1990 , 1992 .

[89]  D. Pearl,et al.  Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. , 2007, Systematic biology.

[90]  Ziheng Yang,et al.  The Impact of the Rate Prior on Bayesian Estimation of Divergence Times with Multiple Loci , 2014, Systematic biology.

[91]  S. Gribaldo,et al.  The two-domain tree of life is linked to a new root for the Archaea , 2015, Proceedings of the National Academy of Sciences.

[92]  W. Li,et al.  Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. , 1995, Molecular biology and evolution.

[93]  M. Stech,et al.  Evolution of endemism on a young tropical mountain , 2015, Nature.

[94]  M. Newton,et al.  Phylogenetic Inference for Binary Data on Dendograms Using Markov Chain Monte Carlo , 1997 .

[95]  M. Suchard,et al.  Phylogeography takes a relaxed random walk in continuous space and time. , 2010, Molecular biology and evolution.

[96]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[97]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[98]  Claudia R. Solís-Lemus,et al.  Bayesian species delimitation combining multiple genes and traits in a unified framework , 2015, Evolution; international journal of organic evolution.

[99]  Carlos E. Rodríguez,et al.  Searching for efficient Markov chain Monte Carlo proposal kernels , 2013, Proceedings of the National Academy of Sciences.

[100]  Thomas J Naughton,et al.  Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified , 2006, BMC Evolutionary Biology.

[101]  D. Swofford,et al.  The Effect of Taxon Sampling on Estimating Rate Heterogeneity Parameters of Maximum-Likelihood Models , 1999 .

[102]  M. Holder,et al.  Phycas: software for Bayesian phylogenetic analysis. , 2015, Systematic biology.

[103]  Alexandros Stamatakis,et al.  Does the choice of nucleotide substitution models matter topologically? , 2016, BMC Bioinformatics.