Markov-modulated continuous-time Markov chains to identify site- and branch-specific evolutionary variation

Abstract Markov models of character substitution on phylogenies form the foundation of phylogenetic inference frameworks. Early models made the simplifying assumption that the substitution process is homogeneous over time and across sites in the molecular sequence alignment. While standard practice adopts extensions that accommodate heterogeneity of substitution rates across sites, heterogeneity in the process over time in a site-specific manner remains frequently overlooked. This is problematic, as evolutionary processes that act at the molecular level are highly variable, subjecting different sites to different selective constraints over time, impacting their substitution behavior. We propose incorporating time variability through Markov-modulated models (MMMs), which extend covarion-like models and allow the substitution process (including relative character exchange rates as well as the overall substitution rate) at individual sites to vary across lineages. We implement a general MMM framework in BEAST, a popular Bayesian phylogenetic inference software package, allowing researchers to compose a wide range of MMMs through flexible XML specification. Using examples from bacterial, viral, and plastid genome evolution, we show that MMMs impact phylogenetic tree estimation and can substantially improve model fit compared to standard substitution models. Through simulations, we show that marginal likelihood estimation accurately identifies the generative model and does not systematically prefer the more parameter-rich MMMs. To mitigate the increased computational demands associated with MMMs, our implementation exploits recent developments in BEAGLE, a high-performance computational library for phylogenetic inference. [Bayesian inference; BEAGLE; BEAST; covarion, heterotachy; Markov-modulated models; phylogenetics.]

[1]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[2]  Alain Jean-Marie,et al.  Markov-Modulated Markov Chains and the Covarion Process of Molecular Evolution , 2004, J. Comput. Biol..

[3]  C. Viboud,et al.  Explorer The genomic and epidemiological dynamics of human influenza A virus , 2016 .

[4]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[5]  J. Huelsenbeck Testing a covariotide model of DNA substitution. , 2002, Molecular biology and evolution.

[6]  W. Fitch,et al.  An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution , 1970, Biochemical Genetics.

[7]  Edward Susko,et al.  PROCOV: maximum likelihood estimation of protein phylogeny under covarion models and site-specific covarion pattern analysis , 2009, BMC Evolutionary Biology.

[8]  Z. Yang,et al.  Among-site rate variation and its impact on phylogenetic analyses. , 1996, Trends in ecology & evolution.

[9]  M. Suchard,et al.  Genealogical Working Distributions for Bayesian Model Testing with Phylogenetic Uncertainty. , 2016, Systematic biology.

[10]  R. Murray,et al.  The Family Deinococcaceae , 1992 .

[11]  E. Holmes,et al.  The evolution of base composition and phylogenetic inference. , 2000, Trends in ecology & evolution.

[12]  Olivier Gascuel,et al.  Modelling the Variability of Evolutionary Processes , 2007 .

[13]  Carl E. Rasmussen,et al.  Factorial Hidden Markov Models , 1997 .

[14]  J. G. Burleigh,et al.  Covarion structure in plastid genome evolution: a new statistical test. , 2005, Molecular biology and evolution.

[15]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[16]  B. Ames,et al.  Sunlight ultraviolet and bacterial DNA base ratios. , 1970, Science.

[17]  Daniel L. Ayres,et al.  BEAGLE 3: Improved Performance, Scaling, and Usability for a High-Performance Computing Library for Statistical Phylogenetics , 2019, Systematic biology.

[18]  Stéphane Guindon,et al.  Modeling the site-specific variation of selection patterns along lineages. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Jan Irvahn,et al.  Phylogenetic Stochastic Mapping , 2015 .

[20]  Moshe Haviv,et al.  Introduction to Markov Chains , 2013 .

[21]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[22]  Edward Susko,et al.  Testing for covarion-like evolution in protein sequences. , 2007, Molecular biology and evolution.

[23]  Ming-Hui Chen,et al.  Choosing among Partition Models in Bayesian Phylogenetics , 2010, Molecular biology and evolution.

[24]  Peter G Foster,et al.  Modeling compositional heterogeneity. , 2004, Systematic biology.

[25]  Guy Baele,et al.  πBUSS: a parallel BEAST/BEAGLE utility for sequence simulation under complex evolutionary scenarios , 2013, BMC Bioinformatics.

[26]  Alexei J Drummond,et al.  Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences. , 2006, Molecular biology and evolution.

[27]  Esra Bas An Introduction to Markov Chains , 2019 .

[28]  M. Pagel,et al.  A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. , 2004, Systematic biology.

[29]  M. Steel,et al.  A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. , 1998, Molecular biology and evolution.

[30]  Guy Baele,et al.  Adaptive MCMC in Bayesian phylogenetics: an application to analyzing partitioned data in BEAST , 2017, Bioinform..

[31]  Hervé Philippe,et al.  A dirichlet process covarion mixture model and its assessments using posterior predictive discrepancy tests. , 2010, Molecular biology and evolution.

[32]  Alexei J. Drummond,et al.  Bayesian Selection of Nucleotide Substitution Models and Their Site Assignments , 2012, Molecular biology and evolution.

[33]  Thomas E. Stern,et al.  Analysis of separable Markov-modulated rate models for information-handling systems , 1991, Advances in Applied Probability.

[34]  H. Philippe,et al.  Heterotachy, an important process of protein evolution. , 2002, Molecular biology and evolution.

[35]  N. Galtier,et al.  Maximum-likelihood phylogenetic analysis under a covarion-like model. , 2001, Molecular biology and evolution.

[36]  Marc A. Suchard,et al.  Many-core algorithms for statistical phylogenetics , 2009, Bioinform..

[37]  Nicolas Lartillot,et al.  A Bayesian compound stochastic process for modeling nonstationary and nonhomogeneous sequence evolution. , 2006, Molecular biology and evolution.

[38]  S. Jeffery Evolution of Protein Molecules , 1979 .

[39]  R. H. Thomas,et al.  Reduced thermophilic bias in the 16S rDNA sequence from Thermus ruber provides further support for a relationship between Thermus and Deinococcus , 1993 .

[40]  Victor Y. Pan,et al.  The complexity of the matrix eigenproblem , 1999, STOC '99.

[41]  Michael D. Hendy,et al.  Mathematical Elegance with Biochemical Realism: The Covarion Model of Molecular Evolution , 2001, Journal of Molecular Evolution.

[42]  G. Serio,et al.  A new method for calculating evolutionary substitution rates , 2005, Journal of Molecular Evolution.

[43]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[44]  Wolfgang Fischer,et al.  The Markov-Modulated Poisson Process (MMPP) Cookbook , 1993, Perform. Evaluation.

[45]  Michael Worobey,et al.  A synchronized global sweep of the internal genes of modern avian influenza virus , 2014, Nature.

[46]  M. Steel,et al.  Modeling the covarion hypothesis of nucleotide substitution. , 1998, Mathematical biosciences.

[47]  Simon Whelan,et al.  Spatial and temporal heterogeneity in nucleotide sequence evolution. , 2008, Molecular biology and evolution.

[48]  Mike Steel,et al.  Should phylogenetic models be trying to "fit an elephant"? , 2005, Trends in genetics : TIG.

[49]  Xiang Ji,et al.  A Phylogenetic Approach Finds Abundant Interlocus Gene Conversion in Yeast. , 2016, Molecular biology and evolution.

[50]  Bernard W. Silverman,et al.  The kernel method for multivariate data , 2018 .

[51]  R. Nielsen Mapping mutations on phylogenies. , 2002, Systematic biology.

[52]  Vladimir N. Minin,et al.  Phylogenetic Stochastic Mapping Without Matrix Exponentiation , 2014, J. Comput. Biol..

[53]  M. Donoghue,et al.  Identifying hidden rate changes in the evolution of a binary morphological character: the evolution of plant habit in campanulid angiosperms. , 2013, Systematic biology.

[54]  Daniel L. Ayres,et al.  Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10 , 2018, Virus evolution.

[55]  G. Yule,et al.  A Mathematical Theory of Evolution Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .