An improved general amino acid replacement matrix.

Amino acid replacement matrices are an essential basis of protein phylogenetics. They are used to compute substitution probabilities along phylogeny branches and thus the likelihood of the data. They are also essential in protein alignment. A number of replacement matrices and methods to estimate these matrices from protein alignments have been proposed since the seminal work of Dayhoff et al. (1972). An important advance was achieved by Whelan and Goldman (2001) and their WAG matrix, thanks to an efficient maximum likelihood estimation approach that accounts for the phylogenies of sequences within each training alignment. We further refine this method by incorporating the variability of evolutionary rates across sites in the matrix estimation and using a much larger and diverse database than BRKALN, which was used to estimate WAG. To estimate our new matrix (called LG after the authors), we use an adaptation of the XRATE software and 3,912 alignments from Pfam, comprising approximately 50,000 sequences and approximately 6.5 million residues overall. To evaluate the LG performance, we use an independent sample consisting of 59 alignments from TreeBase and randomly divide Pfam alignments into 3,412 training and 500 test alignments. The comparison with WAG and JTT shows a clear likelihood improvement. With TreeBase, we find that 1) the average Akaike information criterion gain per site is 0.25 and 0.42, when compared with WAG and JTT, respectively; 2) LG is significantly better than WAG for 38 alignments (among 59), and significantly worse with 2 alignments only; and 3) tree topologies inferred with LG, WAG, and JTT frequently differ, indicating that using LG impacts not only the likelihood value but also the output tree. Results with the test alignments from Pfam are analogous. LG and a PHYML implementation can be downloaded from http://atgc.lirmm.fr/LG.

[1]  W. Mac Stewart A Note on the Power of the Sign Test , 1941 .

[2]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[3]  H. Akaike A new look at the statistical model identification , 1974 .

[4]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[5]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[6]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[7]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[8]  N. Saito The neighbor-joining method : A new method for reconstructing phylogenetic trees , 1987 .

[9]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[10]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[11]  S A Benner,et al.  Amino acid substitution during functionally constrained divergent evolution of protein sequences. , 1994, Protein engineering.

[12]  David C. Jones,et al.  A mutation data matrix for transmembrane proteins , 1994, FEBS letters.

[13]  W. Li,et al.  Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. , 1995, Molecular biology and evolution.

[14]  R A Goldstein,et al.  Context-dependent optimal substitution matrices. , 1995, Protein engineering.

[15]  David C. Jones,et al.  Combining protein evolution and secondary structure. , 1996, Molecular biology and evolution.

[16]  João Meidanis,et al.  Introduction to computational molecular biology , 1997 .

[17]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[18]  Lars Arvestad,et al.  Estimation of Reversible Substitution Matrices from Multiple Pairs of Sequences , 1997, Journal of Molecular Evolution.

[19]  David C. Jones,et al.  Assessing the impact of secondary structure and solvent accessibility on protein evolution. , 1998, Genetics.

[20]  Z. Yang,et al.  Models of amino acid substitution and applications to mitochondrial protein evolution. , 1998, Molecular biology and evolution.

[21]  J. Thorne,et al.  Models of protein sequence evolution and their applications. , 2000, Current opinion in genetics & development.

[22]  P. Waddell,et al.  Plastid Genome Phylogeny and a Model of Amino Acid Substitution for Proteins Encoded by Chloroplast DNA , 2000, Journal of Molecular Evolution.

[23]  Wei Qian,et al.  Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. , 2000, Molecular biology and evolution.

[24]  Martin Vingron,et al.  Modeling Amino Acid Replacement , 2000, J. Comput. Biol..

[25]  A. Rodrigo,et al.  Likelihood-based tests of topologies in phylogenetics. , 2000, Systematic biology.

[26]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[27]  Richard A. Goldstein,et al.  rtREV: An Amino Acid Substitution Matrix for Inference of Retrovirus and Reverse Transcriptase Phylogeny , 2002, Journal of Molecular Evolution.

[28]  I Holmes,et al.  An expectation maximization algorithm for training hidden substitution models. , 2002, Journal of molecular biology.

[29]  Bruno Torrésani,et al.  Rate Matrices for Analyzing Large Families of Protein Sequences , 2002, J. Comput. Biol..

[30]  M. Barnes,et al.  Bioinformatics for geneticists. , 2003 .

[31]  R. Russell,et al.  Amino‐Acid Properties and Consequences of Substitutions , 2003 .

[32]  Andrew D. Smith,et al.  A Transition Probability Model for Amino Acid Substitutions from Blocks , 2003, J. Comput. Biol..

[33]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[34]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[35]  Lars Arvestad,et al.  Efficient Methods for Estimating Amino Acid Replacement Rates , 2006, Journal of Molecular Evolution.

[36]  H. Philippe,et al.  A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. , 2004, Molecular biology and evolution.

[37]  D. Posada,et al.  Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. , 2004, Systematic biology.

[38]  N. Goldman,et al.  Different versions of the Dayhoff rate matrix. , 2005, Molecular biology and evolution.

[39]  Olivier Gascuel,et al.  Mathematics of Evolution and Phylogeny , 2005 .

[40]  Thomas J Naughton,et al.  Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified , 2006, BMC Evolutionary Biology.

[41]  Olivier Gascuel,et al.  Improving the efficiency of SPR moves in phylogenetic tree search methods based on maximum likelihood , 2005, Bioinform..

[42]  H. Kishino,et al.  Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea , 1989, Journal of Molecular Evolution.

[43]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[44]  O. Gascuel Mathematics of Evolution & Phylogeny , 2005 .

[45]  M. Hasegawa,et al.  Model of amino acid substitution in proteins encoded by mitochondrial DNA , 1996, Journal of Molecular Evolution.

[46]  Ian Holmes,et al.  XRate: a fast prototyping, training and annotation tool for phylo-grammars , 2006, BMC Bioinformatics.

[47]  Ziheng Yang,et al.  Computational Molecular Evolution , 2006 .

[48]  David Posada,et al.  MtArt: a new model of amino acid replacement for Arthropoda. , 2006, Molecular biology and evolution.

[49]  Ian Holmes,et al.  An empirical codon model for protein sequence evolution. , 2007, Molecular biology and evolution.

[50]  Andrew Meade,et al.  Mixture models in phylogenetic inference , 2007, Mathematics of Evolution and Phylogeny.

[51]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[52]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..