Bayesian Selection of Nucleotide Substitution Models and Their Site Assignments

Probabilistic inference of a phylogenetic tree from molecular sequence data is predicated on a substitution model describing the relative rates of change between character states along the tree for each site in the multiple sequence alignment. Commonly, one assumes that the substitution model is homogeneous across sites within large partitions of the alignment, assigns these partitions a priori, and then fixes their underlying substitution model to the best-fitting model from a hierarchy of named models. Here, we introduce an automatic model selection and model averaging approach within a Bayesian framework that simultaneously estimates the number of partitions, the assignment of sites to partitions, the substitution model for each partition, and the uncertainty in these selections. This new approach is implemented as an add-on to the BEAST 2 software platform. We find that this approach dramatically improves the fit of the nucleotide substitution model compared with existing approaches, and we show, using a number of example data sets, that as many as nine partitions are required to explain the heterogeneity in nucleotide substitution process across sites in a single gene analysis. In some instances, this improved modeling of the substitution process can have a measurable effect on downstream inference, including the estimated phylogeny, relative divergence times, and effective population size histories.

[1]  M. Newton Approximate Bayesian-inference With the Weighted Likelihood Bootstrap , 1994 .

[2]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[3]  David B. Dahl,et al.  Sequentially-Allocated Merge-Split Sampler for Conjugate and Nonconjugate Dirichlet Process Mixture Models , 2005 .

[4]  M. Suchard,et al.  Joint Bayesian estimation of alignment and phylogeny. , 2005, Systematic biology.

[5]  R A Goldstein,et al.  Using physical-chemistry-based substitution models in phylogenetic analyses of HIV-1 subtypes. , 1999, Molecular biology and evolution.

[6]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[7]  Bernard M. E. Moret,et al.  Phylogenetic Inference , 2011, Encyclopedia of Parallel Computing.

[8]  H. Philippe,et al.  A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. , 2004, Molecular biology and evolution.

[9]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[10]  A. Drummond,et al.  Bayesian Inference of Species Trees from Multilocus Data , 2009, Molecular biology and evolution.

[11]  B. Larget,et al.  Markov Chain Monte Carlo Algorithms for the Bayesian Analysis of Phylogenetic Trees , 2000 .

[12]  Matthew W. Dimmic,et al.  Modeling evolution at the protein level using an adjustable amino acid fitness model. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[13]  L. Jin,et al.  Limitations of the evolutionary parsimony method of phylogenetic analysis. , 1990, Molecular biology and evolution.

[14]  O. Pybus,et al.  The Epidemic Behavior of the Hepatitis C Virus , 2001, Science.

[15]  John P Huelsenbeck,et al.  A Dirichlet process model for detecting positive selection in protein-coding DNA sequences. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[16]  W. Bruno Modeling residue usage in aligned protein sequences via maximum likelihood. , 1996, Molecular biology and evolution.

[17]  Richard A. Goldstein,et al.  Using Evolutionary Methods to Study G-Protein Coupled Receptors , 2001, Pacific Symposium on Biocomputing.

[18]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[19]  G. Olsen,et al.  Earliest phylogenetic branchings: comparing rRNA-based evolutionary trees inferred with various techniques. , 1987, Cold Spring Harbor symposia on quantitative biology.

[20]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[21]  L. M. M.-T. Theory of Probability , 1929, Nature.

[22]  Alexei J Drummond,et al.  Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences. , 2006, Molecular biology and evolution.

[23]  O. Gascuel,et al.  New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. , 2010, Systematic biology.

[24]  S. Godsill On the Relationship Between Markov chain Monte Carlo Methods for Model Uncertainty , 2001 .

[25]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[26]  Z. Yang,et al.  Among-site rate variation and its impact on phylogenetic analyses. , 1996, Trends in ecology & evolution.

[27]  R. Lanfear,et al.  Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. , 2012, Molecular biology and evolution.

[28]  Chieh-Hsi Wu,et al.  Joint Inference of Microsatellite Mutation Models, Population History and Genealogies Using Transdimensional Markov Chain Monte Carlo , 2011, Genetics.

[29]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[30]  M. Newton,et al.  Phylogenetic Inference for Binary Data on Dendograms Using Markov Chain Monte Carlo , 1997 .

[31]  Frédéric Delsuc,et al.  OrthoMaM: A database of orthologous genomic markers for placental mammal phylogenetics , 2007, BMC Evolutionary Biology.

[32]  Edward C. Holmes,et al.  Rates of Molecular Evolution in RNA Viruses: A Quantitative Phylogenetic Analysis , 2002, Journal of Molecular Evolution.

[33]  M. Suchard,et al.  Hierarchical phylogenetic models for analyzing multipartite sequence data. , 2003, Systematic biology.

[34]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[35]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[36]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[37]  David Posada,et al.  MODELTEST: testing the model of DNA substitution , 1998, Bioinform..

[38]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[39]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[40]  Adrian E. Raftery,et al.  MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering † , 2007 .

[41]  Simon Whelan,et al.  Spatial and temporal heterogeneity in nucleotide sequence evolution. , 2008, Molecular biology and evolution.

[42]  J. Huelsenbeck,et al.  Bayesian phylogenetic model selection using reversible jump Markov chain Monte Carlo. , 2004, Molecular biology and evolution.

[43]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[44]  W. Li,et al.  Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. , 1995, Molecular biology and evolution.

[45]  Anne-Mieke Vandamme,et al.  Genetic Variability and Molecular Evolution of the Human Respiratory Syncytial Virus Subgroup B Attachment G Protein , 2005, Journal of Virology.

[46]  Alexei J Drummond,et al.  Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. , 2002, Genetics.

[47]  John P. Huelsenbeck,et al.  Variation in the Pattern of Nucleotide Substitution Across Sites , 1999, Journal of Molecular Evolution.

[48]  P. Lio’,et al.  Using protein structural information in evolutionary inference: transmembrane proteins. , 1999, Molecular biology and evolution.

[49]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[50]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[51]  Marc A Suchard,et al.  A nonparametric method for accommodating and testing across-site rate variation. , 2007, Systematic biology.

[52]  J. Huelsenbeck,et al.  Bayesian analysis of amino acid substitution models , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[53]  O. Pybus,et al.  The epidemiology and iatrogenic transmission of hepatitis C virus in Egypt: a Bayesian coalescent approach. , 2003, Molecular biology and evolution.

[54]  G. B. Golding,et al.  Estimates of DNA and protein sequence divergence: an examination of some assumptions. , 1983, Molecular biology and evolution.

[55]  Michael Defoin-Platel,et al.  Clock-constrained tree proposal operators in Bayesian phylogenetic inference , 2008, 2008 8th IEEE International Conference on BioInformatics and BioEngineering.

[56]  M. Steel,et al.  General time-reversible distances with unequal rates across sites: mixing gamma and inverse Gaussian distributions with invariant sites. , 1997, Molecular phylogenetics and evolution.

[57]  R. Nielsen,et al.  Site-by-site estimation of the rate of substitution and the correlation of rates in mitochondrial DNA. , 1997, Systematic biology.

[58]  J. Huelsenbeck,et al.  SUCCESS OF PHYLOGENETIC METHODS IN THE FOUR-TAXON CASE , 1993 .

[59]  D. Balding,et al.  Genealogical inference from microsatellite data. , 1998, Genetics.

[60]  O. Pybus,et al.  Bayesian coalescent inference of past population dynamics from molecular sequences. , 2005, Molecular biology and evolution.

[61]  M. Suchard,et al.  Bayesian selection of continuous-time Markov chain evolutionary models. , 2001, Molecular biology and evolution.

[62]  M A Newton,et al.  Bayesian Phylogenetic Inference via Markov Chain Monte Carlo Methods , 1999, Biometrics.

[63]  D L Thomas,et al.  Genetic epidemiology of hepatitis C virus throughout egypt. , 2000, The Journal of infectious diseases.

[64]  Sergei L. Kosakovsky Pond,et al.  Purifying Selection Can Obscure the Ancient Age of Viral Lineages , 2011, Molecular biology and evolution.

[65]  S. Ho,et al.  Relaxed Phylogenetics and Dating with Confidence , 2006, PLoS biology.

[66]  G A Churchill,et al.  Sample size for a phylogenetic inference. , 1992, Molecular biology and evolution.

[67]  James E. Byers,et al.  MODEL SELECTION IN PHYLOGENETICS , 2005 .

[68]  David C. Jones,et al.  Assessing the impact of secondary structure and solvent accessibility on protein evolution. , 1998, Genetics.

[69]  Olivier Gascuel,et al.  Empirical profile mixture models for phylogenetic reconstruction , 2008, Bioinform..

[70]  M. Nei,et al.  Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. , 1993, Molecular biology and evolution.

[71]  Alexei J. Drummond,et al.  Bayesian Phylogeography Finds Its Roots , 2009, PLoS Comput. Biol..

[72]  Maxim Teslenko,et al.  MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice Across a Large Model Space , 2012, Systematic biology.

[73]  M. Pagel,et al.  A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. , 2004, Systematic biology.

[74]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[75]  Rebecca R. Gray,et al.  The mode and tempo of hepatitis C virus evolution within and among hosts , 2011, BMC Evolutionary Biology.