CodonTest: Modeling Amino Acid Substitution Preferences in Coding Sequences

Codon models of evolution have facilitated the interpretation of selective forces operating on genomes. These models, however, assume a single rate of non-synonymous substitution irrespective of the nature of amino acids being exchanged. Recent developments have shown that models which allow for amino acid pairs to have independent rates of substitution offer improved fit over single rate models. However, these approaches have been limited by the necessity for large alignments in their estimation. An alternative approach is to assume that substitution rates between amino acid pairs can be subdivided into rate classes, dependent on the information content of the alignment. However, given the combinatorially large number of such models, an efficient model search strategy is needed. Here we develop a Genetic Algorithm (GA) method for the estimation of such models. A GA is used to assign amino acid substitution pairs to a series of rate classes, where is estimated from the alignment. Other parameters of the phylogenetic Markov model, including substitution rates, character frequencies and branch lengths are estimated using standard maximum likelihood optimization procedures. We apply the GA to empirical alignments and show improved model fit over existing models of codon evolution. Our results suggest that current models are poor approximations of protein evolution and thus gene and organism specific multi-rate models that incorporate amino acid substitution biases are preferred. We further anticipate that the clustering of amino acid substitution rates into classes will be biologically informative, such that genes with similar functions exhibit similar clustering, and hence this clustering will be useful for the evolutionary fingerprinting of genes.

[1]  O. Gascuel,et al.  An improved general amino acid replacement matrix. , 2008, Molecular biology and evolution.

[2]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[3]  S. Muse,et al.  Site-to-site variation of synonymous substitution rates. , 2005, Molecular biology and evolution.

[4]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[5]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[6]  David C. Nickle,et al.  HIV-Specific Probabilistic Models of Protein Evolution , 2007, PloS one.

[7]  J. Huelsenbeck,et al.  Bayesian analysis of amino acid substitution models , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[8]  Maria Anisimova,et al.  Investigating protein-coding sequence evolution with probabilistic codon substitution models. , 2009, Molecular biology and evolution.

[9]  T. Tatusova,et al.  The Influenza Virus Resource at the National Center for Biotechnology Information , 2007, Journal of Virology.

[10]  P. Waddell,et al.  Plastid Genome Phylogeny and a Model of Amino Acid Substitution for Proteins Encoded by Chloroplast DNA , 2000, Journal of Molecular Evolution.

[11]  David Posada,et al.  MODELTEST: testing the model of DNA substitution , 1998, Bioinform..

[12]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[13]  Wendy S. W. Wong,et al.  Identification of physicochemical selective pressure on protein encoding nucleotide sequences , 2006, BMC Bioinformatics.

[14]  Sergei L. Kosakovsky Pond,et al.  Not so different after all: a comparison of methods for detecting amino acid sites under selection. , 2005, Molecular biology and evolution.

[15]  Hervé Philippe,et al.  Computational methods for evaluating phylogenetic models of coding sequence evolution with dependence between codons. , 2009, Molecular biology and evolution.

[16]  R. Nielsen,et al.  Detecting Site-Specific Physicochemical Selective Pressures: Applications to the Class I HLA of the Human Major Histocompatibility Complex and the SRK of the Plant Sporophytic Self-Incompatibility System , 2005, Journal of Molecular Evolution.

[17]  Richard A. Goldstein,et al.  rtREV: An Amino Acid Substitution Matrix for Inference of Retrovirus and Reverse Transcriptase Phylogeny , 2002, Journal of Molecular Evolution.

[18]  Peter F Stadler,et al.  Modeling amino acid substitution patterns in orthologous and paralogous genes. , 2007, Molecular phylogenetics and evolution.

[19]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[20]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[21]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[22]  Sergei L. Kosakovsky Pond,et al.  Benchmarking Multi-Rate Codon Models , 2010, PloS one.

[23]  Todd M. Allen,et al.  HIV evolution: CTL escape mutation and reversion after transmission , 2004, Nature Medicine.

[24]  Colin A. Russell,et al.  The Global Circulation of Seasonal Influenza A (H3N2) Viruses , 2008, Science.

[25]  Sergei L. Kosakovsky Pond,et al.  An Evolutionary Model-Based Algorithm for Accurate Phylogenetic Breakpoint Mapping and Subtype Prediction in HIV-1 , 2009, PLoS Comput. Biol..

[26]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[27]  D. Posada jModelTest: phylogenetic model averaging. , 2008, Molecular biology and evolution.

[28]  Sergei L. Kosakovsky Pond,et al.  An Evolutionary-Network Model Reveals Stratified Interactions in the V3 Loop of the HIV-1 Envelope , 2007, PLoS Comput. Biol..

[29]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[30]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[31]  N. Goldman,et al.  A codon-based model of nucleotide substitution for protein-coding DNA sequences. , 1994, Molecular biology and evolution.

[32]  M. Hasegawa,et al.  Model of amino acid substitution in proteins encoded by mitochondrial DNA , 1996, Journal of Molecular Evolution.

[33]  David R. Anderson,et al.  Model Selection and Multimodel Inference , 2003 .

[34]  Simon Whelan,et al.  Estimating the Frequency of Events That Cause Multiple-Nucleotide Changes , 2004, Genetics.

[35]  Ian Holmes,et al.  An empirical codon model for protein sequence evolution. , 2007, Molecular biology and evolution.

[36]  L. Stanfel,et al.  A new approach to clustering the amino acids. , 1996, Journal of theoretical biology.

[37]  Sergei L. Kosakovsky Pond,et al.  A genetic algorithm approach to detecting lineage-specific variation in selection pressure. , 2005, Molecular biology and evolution.

[38]  T. Pupko,et al.  A combined empirical and mechanistic codon model. , 2006, Molecular biology and evolution.

[39]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[40]  S. Muse,et al.  A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. , 1994, Molecular biology and evolution.

[41]  Huan Zhang,et al.  Elucidation of phenotypic adaptations: Molecular analyses of dim-light vision proteins in vertebrates , 2008, Proceedings of the National Academy of Sciences.

[42]  Sergei L. Kosakovsky Pond,et al.  Datamonkey: rapid detection of selective pressure on individual sites of codon alignments , 2005, Bioinform..

[43]  Konrad Scheffler,et al.  Evolutionary fingerprinting of genes. , 2010, Molecular biology and evolution.

[44]  Sergei L. Kosakovsky Pond,et al.  Evolutionary model selection with a genetic algorithm: a case study using stem RNA. , 2007, Molecular biology and evolution.

[45]  R. Shamir,et al.  A fast algorithm for joint reconstruction of ancestral amino acid sequences. , 2000, Molecular biology and evolution.

[46]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[47]  Konrad Scheffler,et al.  Models of coding sequence evolution , 2008, Briefings Bioinform..

[48]  Sergei L. Kosakovsky Pond,et al.  HyPhy: hypothesis testing using phylogenies , 2005, Bioinform..

[49]  David Heckerman,et al.  Evidence of Differential HLA Class I-Mediated Viral Evolution in Functional and Accessory/Regulatory Genes of HIV-1 , 2007, PLoS pathogens.

[50]  Nicolas Rodriguez,et al.  PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees , 2005, Nucleic Acids Res..

[51]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[52]  Peter F Stadler,et al.  Solvent exposure imparts similar selective pressures across a range of yeast proteins. , 2009, Molecular biology and evolution.

[53]  Tanmoy Bhattacharya,et al.  HLA Class I-Driven Evolution of Human Immunodeficiency Virus Type 1 Subtype C Proteome: Immune Escape and Viral Load , 2008, Journal of Virology.

[54]  David Posada,et al.  Automated phylogenetic detection of recombination using a genetic algorithm. , 2006, Molecular biology and evolution.

[55]  A. Atkinson A note on the generalized information criterion for choice of a model , 1980 .

[56]  N. Goldman,et al.  Codon-substitution models for heterogeneous selection pressure at amino acid sites. , 2000, Genetics.