Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models

Over the years, there have been claims that evolution proceeds according to systematically different processes over different timescales and that protein evolution behaves in a non-Markovian manner. On the other hand, Markov models are fundamental to many applications in evolutionary studies. Apparent non-Markovian or time-dependent behavior has been attributed to influence of the genetic code at short timescales and dominance of physicochemical properties of the amino acids at long timescales. However, any long time period is simply the accumulation of many short time periods, and it remains unclear why evolution should appear to act systematically differently across the range of timescales studied. We show that the observed time-dependent behavior can be explained qualitatively by modeling protein sequence evolution as an aggregated Markov process (AMP): a time-homogeneous Markovian substitution model observed only at the level of the amino acids encoded by the protein-coding DNA sequence. The study of AMPs sheds new light on the relationship between amino acid-level and codon-level models of sequence evolution, and our results suggest that protein evolution should be modeled at the codon level rather than using amino acid substitution models.

[1]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[2]  M. Hasegawa,et al.  Model of amino acid substitution in proteins encoded by mitochondrial DNA , 1996, Journal of Molecular Evolution.

[3]  P. Lio’,et al.  Using protein structural information in evolutionary inference: transmembrane proteins. , 1999, Molecular biology and evolution.

[4]  Cleve B. Moler,et al.  Nineteen Dubious Ways to Compute the Exponential of a Matrix, Twenty-Five Years Later , 1978, SIAM Rev..

[5]  Dr. Susumu Ohno Evolution by Gene Duplication , 1970, Springer Berlin Heidelberg.

[6]  Ian Holmes,et al.  XRate: a fast prototyping, training and annotation tool for phylo-grammars , 2006, BMC Bioinformatics.

[7]  Simon Whelan,et al.  A novel use of equilibrium frequencies in models of sequence evolution. , 2002, Molecular biology and evolution.

[8]  Hans Kiinsch,et al.  State Space and Hidden Markov Models , 2000 .

[9]  Z. Yang,et al.  Models of amino acid substitution and applications to mitochondrial protein evolution. , 1998, Molecular biology and evolution.

[10]  C. Seoighe,et al.  Significantly different patterns of amino acid replacement after gene duplication as compared to after speciation. , 2003, Molecular biology and evolution.

[11]  Erhan Çinlar,et al.  Introduction to stochastic processes , 1974 .

[12]  D. Balding,et al.  Handbook of statistical genetics , 2004 .

[13]  D. Gillespie Markov Processes: An Introduction for Physical Scientists , 1991 .

[14]  David C. Jones,et al.  Assessing the impact of secondary structure and solvent accessibility on protein evolution. , 1998, Genetics.

[15]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[16]  N. Goldman,et al.  A codon-based model of nucleotide substitution for protein-coding DNA sequences. , 1994, Molecular biology and evolution.

[17]  Ziheng Yang,et al.  Computational Molecular Evolution , 2006 .

[18]  R. Nielsen,et al.  Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. , 2002, Molecular biology and evolution.

[19]  Bret Larget,et al.  A canonical representation for aggregated Markov processes , 1998, Journal of Applied Probability.

[20]  N. Goldman,et al.  Different versions of the Dayhoff rate matrix. , 2005, Molecular biology and evolution.

[21]  D. Cox,et al.  Complex stochastic systems , 2000 .

[22]  Q. Yang,et al.  Oscillation Theorems of the Second Order Linear Matrix Differential System with Damping , 2005 .

[23]  Ian Holmes,et al.  An empirical codon model for protein sequence evolution. , 2007, Molecular biology and evolution.

[24]  Simon Whelan,et al.  Estimating the Frequency of Events That Cause Multiple-Nucleotide Changes , 2004, Genetics.

[25]  W. Li,et al.  Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. , 1995, Molecular biology and evolution.

[26]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[27]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[28]  R. Durbin,et al.  Tree-based maximal likelihood substitution matrices and hidden Markov models , 1995, Journal of Molecular Evolution.

[29]  Ryan D. Hernandez,et al.  Simultaneous inference of selection and population growth from patterns of variation in the human genome , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[30]  N. Goldman,et al.  Codon-substitution models for heterogeneous selection pressure at amino acid sites. , 2000, Genetics.

[31]  E. Koonin,et al.  Selection in the evolution of gene duplications , 2002, Genome Biology.

[32]  T. Massingham,et al.  Detecting Amino Acid Sites Under Positive Selection and Purifying Selection , 2005, Genetics.

[33]  Joseph Felsenstein,et al.  Maximum Likelihood and Minimum-Steps Methods for Estimating Evolutionary Trees from Data on Discrete Characters , 1973 .

[34]  S A Benner,et al.  Amino acid substitution during functionally constrained divergent evolution of protein sequences. , 1994, Protein engineering.

[35]  P. Lio’,et al.  Models of molecular evolution and phylogeny. , 1998, Genome research.

[36]  N. Goldman,et al.  Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. , 1994, Molecular biology and evolution.

[37]  O. Gascuel,et al.  Accounting for solvent accessibility and secondary structure in protein phylogenetics is clearly beneficial. , 2010, Systematic biology.

[38]  S. Muse,et al.  A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. , 1994, Molecular biology and evolution.

[39]  P. Bork,et al.  Towards a structural basis of human non-synonymous single nucleotide polymorphisms. , 2000, Trends in genetics : TIG.

[40]  Steven Henikoff,et al.  SIFT: predicting amino acid changes that affect protein function , 2003, Nucleic Acids Res..

[41]  Hiroshi Tanaka,et al.  An empirical examination of the utility of codon-substitution models in phylogeny reconstruction. , 2005, Systematic biology.

[42]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[43]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[44]  O. Gascuel,et al.  An improved general amino acid replacement matrix. , 2008, Molecular biology and evolution.

[45]  J. Kiefer,et al.  An Introduction to Stochastic Processes. , 1956 .