Relative Evolutionary Rates in Proteins Are Largely Insensitive to the Substitution Model

Abstract The relative evolutionary rates at individual sites in proteins are informative measures of conservation or adaptation. Often used as evolutionarily aware conservation scores, relative rates reveal key functional or strongly selected residues. Estimating rates in a phylogenetic context requires specifying a protein substitution model, which is typically a phenomenological model trained on a large empirical data set. A strong emphasis has traditionally been placed on selecting the “best-fit” model, with the implicit understanding that suboptimal or otherwise ill-fitting models might bias inferences. However, the pervasiveness and degree of such bias has not been systematically examined. We investigated how model choice impacts site-wise relative rates in a large set of empirical protein alignments. We compared models designed for use on any general protein, models designed for specific domains of life, and the simple equal-rates Jukes Cantor-style model (JC). As expected, information theoretic measures showed overwhelming evidence that some models fit the data decidedly better than others. By contrast, estimates of site-specific evolutionary rates were impressively insensitive to the substitution model used, revealing an unexpected degree of robustness to potential model misspecification. A deeper examination of the fewer than 5% of sites for which model inferences differed in a meaningful way showed that the JC model could uniquely identify rapidly evolving sites that models with empirically derived exchangeabilities failed to detect. We conclude that relative protein rates appear robust to the applied substitution model, and any sensible model of protein evolution, regardless of its fit to the data, should produce broadly consistent evolutionary rates.

[1]  G. Box Science and Statistics , 1976 .

[2]  Benjamin R. Jack,et al.  Measuring evolutionary rates of proteins in a structural context , 2017, F1000Research.

[3]  Z. Yang,et al.  Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution. , 2001, Molecular biology and evolution.

[4]  E. Holmes,et al.  Substitution Model Adequacy and Assessing the Reliability of Estimates of Virus Evolutionary Rates and Time Scales. , 2016, Molecular biology and evolution.

[5]  A. von Haeseler,et al.  IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies , 2014, Molecular biology and evolution.

[6]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[7]  Robert Lanfear,et al.  PartitionFinder 2: New Methods for Selecting Partitioned Models of Evolution for Molecular and Morphological Phylogenetic Analyses. , 2016, Molecular biology and evolution.

[8]  O. Gascuel,et al.  An improved general amino acid replacement matrix. , 2008, Molecular biology and evolution.

[9]  Z. Yang,et al.  Among-site rate variation and its impact on phylogenetic analyses. , 1996, Trends in ecology & evolution.

[10]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[11]  Sergei L. Kosakovsky Pond,et al.  On the Validity of Evolutionary Models with Site-Specific Parameters , 2014, PloS one.

[12]  Stephanie J. Spielman,et al.  The relationship between dN/dS and scaled selection coefficients. , 2015, Molecular biology and evolution.

[13]  Maria Anisimova,et al.  Antibody-Specific Model of Amino Acid Substitution for Immunological Inferences from Alignments of Antibody Sequences , 2014, Molecular biology and evolution.

[14]  N. Ben-Tal,et al.  Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior. , 2004, Molecular biology and evolution.

[15]  T. Stevens,et al.  Substitution rates in alpha-helical transmembrane proteins. , 2001, Protein science : a publication of the Protein Society.

[16]  Stephanie J. Spielman,et al.  Relative evolutionary rate inference in HyPhy with LEISR , 2017, bioRxiv.

[17]  Claus O. Wilke,et al.  Causes of evolutionary rate variation among protein sites , 2016, Nature Reviews Genetics.

[18]  Benjamin R. Jack,et al.  Functional Sites Induce Long-Range Evolutionary Constraints in Enzymes , 2016, PLoS biology.

[19]  Tim J. Stevens,et al.  Substitution rates in α‐helical transmembrane proteins , 2001 .

[20]  Stephanie J. Spielman phyphy: Python package for facilitating the execution and parsing of HyPhy standard analyses , 2018, J. Open Source Softw..

[21]  Ramón Doallo,et al.  ProtTest 3: fast selection of best-fit models of protein evolution , 2011, Bioinform..

[22]  Itay Mayrose,et al.  ConSurf 2016: an improved methodology to estimate and visualize evolutionary conservation in macromolecules , 2016, Nucleic Acids Res..

[23]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[24]  Andrew D. Fernandes,et al.  Site-specific evolutionary rates in proteins are better modeled as non-independent and strictly relative , 2008, Bioinform..

[25]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[26]  C. Seoighe,et al.  Frequent Toggling between Alternative Amino Acids Is Driven by Selection in HIV-1 , 2008, PLoS pathogens.

[27]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[28]  Thomas J Naughton,et al.  Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified , 2006, BMC Evolutionary Biology.

[29]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[30]  O. Gascuel,et al.  Accounting for solvent accessibility and secondary structure in protein phylogenetics is clearly beneficial. , 2010, Systematic biology.

[31]  Thomas K. F. Wong,et al.  ModelFinder: Fast Model Selection for Accurate Phylogenetic Estimates , 2017, Nature Methods.

[32]  J. Huelsenbeck,et al.  Bayesian analysis of amino acid substitution models , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[33]  R. Goldstein,et al.  The tangled bank of amino acids , 2016, Protein science : a publication of the Protein Society.

[34]  D. Posada,et al.  Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. , 2004, Systematic biology.

[35]  Konrad Scheffler,et al.  Models of coding sequence evolution , 2008, Briefings Bioinform..

[36]  Dariya K. Sydykova,et al.  Calculating site-specific evolutionary rates at the amino-acid or codon level yields similar rate estimates , 2017, PeerJ.

[37]  Itay Mayrose,et al.  ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures , 2005, Nucleic Acids Res..

[38]  C. C. Dang,et al.  Improved mitochondrial amino acid substitution models for metazoan evolutionary studies , 2017, BMC Evolutionary Biology.

[39]  Alice C. McHardy,et al.  Detecting Patches of Protein Sites of Influenza A Viruses under Positive Selection , 2012, Molecular biology and evolution.

[40]  David B. Dunson,et al.  Bayesian data analysis, third edition , 2013 .

[41]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[42]  Thomas Uzzell,et al.  Fitting Discrete Probability Distributions to Evolutionary Events , 1971, Science.

[43]  S. Jeffery Evolution of Protein Molecules , 1979 .

[44]  Helen Arnold,et al.  Hitchhiker's guide to the galaxy , 2006, SIGGRAPH '06.

[45]  David C. Nickle,et al.  HIV-Specific Probabilistic Models of Protein Evolution , 2007, PloS one.

[46]  Sergei L. Kosakovsky Pond,et al.  HyPhy: hypothesis testing using phylogenies , 2005, Bioinform..

[47]  R. Shamir,et al.  A fast algorithm for joint reconstruction of ancestral amino acid sequences. , 2000, Molecular biology and evolution.

[48]  C. Cox,et al.  A 20-state empirical amino-acid substitution model for green plant chloroplasts. , 2013, Molecular phylogenetics and evolution.

[49]  Stephanie J. Spielman,et al.  Membrane Environment Imposes Unique Selection Pressures on Transmembrane Domains of G Protein-Coupled Receptors , 2012, Journal of Molecular Evolution.

[50]  Hervé Philippe,et al.  Computational methods for evaluating phylogenetic models of coding sequence evolution with dependence between codons. , 2009, Molecular biology and evolution.

[51]  J. Rozas,et al.  Positive selection in extra cellular domains in the diversification of Strigamia maritima chemoreceptors , 2015, Front. Ecol. Evol..

[52]  Jeremy M. Brown Predictive approaches to assessing the fit of evolutionary models. , 2014, Systematic biology.

[53]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[54]  Jonathan P. Bollback,et al.  Bayesian model adequacy and choice in phylogenetics. , 2002, Molecular biology and evolution.

[55]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[56]  Ramón Doallo,et al.  ProtTest-HPC: Fast Selection of Best-Fit Models of Protein Evolution , 2010, Euro-Par Workshops.

[57]  Ellis L. Reinherz,et al.  PVS: a web server for protein sequence variability analysis tuned to facilitate conserved epitope discovery , 2008, Nucleic Acids Res..

[58]  Itay Mayrose,et al.  Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues , 2002, ISMB.

[59]  J. Adachi,et al.  MOLPHY version 2.3 : programs for molecular phylogenetics based on maximum likelihood , 1996 .

[60]  Ming-Hui Chen,et al.  Posterior predictive Bayesian phylogenetic model selection. , 2014, Systematic biology.