Efficient Methods for Estimating Amino Acid Replacement Rates

Replacement rate matrices describe the process of evolution at one position in a protein and are used in many applications where proteins are studied with an evolutionary perspective. Several general matrices have been suggested and have proved to be good approximations of the real process. However, there are data for which general matrices are inappropriate, for example, special protein families, certain lineages in the tree of life, or particular parts of proteins. Analysis of such data could benefit from adaption of a data-specific rate matrix. This paper suggests two new methods for estimating replacement rate matrices from independent pairwise protein sequence alignments and also carefully studies Müller-Vingron’s resolvent method. Comprehensive tests on synthetic datasets show that both new methods perform better than the resolvent method in a variety of settings. The best method is furthermore demonstrated to be robust on small datasets as well as practical on very large datasets of real data. Neither short nor divergent sequence pairs have to be discarded, making the method economical with data. A generalization to multialignment data is suggested and used in a test on protein-domain family phylogenies, where it is shown that the method offers family-specific rate matrices that often have a significantly better likelihood than a general matrix.

[1]  M. Bishop,et al.  Evolutionary trees from nucleic acid and protein sequences , 1985, Proceedings of the Royal Society of London. Series B. Biological Sciences.

[2]  R A Goldstein,et al.  Context-dependent optimal substitution matrices. , 1995, Protein engineering.

[3]  Pankaj Agarwal,et al.  A Bayesian Evolutionary Distance for Parametrically Aligned Sequences , 1996, J. Comput. Biol..

[4]  S. Nash,et al.  Numerical methods and software , 1990 .

[5]  Martin Vingron,et al.  Modeling Amino Acid Replacement , 2000, J. Comput. Biol..

[6]  Masami Hasegawa,et al.  Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: Instability of a tree based on a single gene , 1994, Journal of Molecular Evolution.

[7]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[8]  M. Kimura The Neutral Theory of Molecular Evolution: Introduction , 1983 .

[9]  M. Hasegawa,et al.  Model of amino acid substitution in proteins encoded by mitochondrial DNA , 1996, Journal of Molecular Evolution.

[10]  Ziheng Yang,et al.  PAML: a program package for phylogenetic analysis by maximum likelihood , 1997, Comput. Appl. Biosci..

[11]  Simon Whelan,et al.  A novel use of equilibrium frequencies in models of sequence evolution. , 2002, Molecular biology and evolution.

[12]  M. Nei,et al.  A new method of inference of ancestral nucleotide and amino acid sequences. , 1995, Genetics.

[13]  Pietro Liò,et al.  PASSML: combining evolutionary inference and protein secondary structure prediction , 1998, Bioinform..

[14]  Simon Whelan,et al.  Pandit: a database of protein and associated nucleotide domains with inferred trees , 2003, Bioinform..

[15]  S. Altschul,et al.  The compositional adjustment of amino acid substitution matrices , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Philip E. Gill,et al.  Numerical Linear Algebra and Optimization , 1991 .

[17]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[18]  G. Serio,et al.  A new method for calculating evolutionary substitution rates , 2005, Journal of Molecular Evolution.

[19]  Sarah A. Teichmann,et al.  Is There a Phylogenetic Signal in Prokaryote Proteins? , 1999, Journal of Molecular Evolution.

[20]  Bruno Torrésani,et al.  Rate Matrices for Analyzing Large Families of Protein Sequences , 2002, J. Comput. Biol..

[21]  Andrew D. Smith,et al.  A Transition Probability Model for Amino Acid Substitutions from Blocks , 2003, J. Comput. Biol..

[22]  P. Lio’,et al.  Modeling Mitochondrial Protein Evolution Using Structural Information , 2002, Journal of Molecular Evolution.

[23]  J. Thorne,et al.  Models of protein sequence evolution and their applications. , 2000, Current opinion in genetics & development.

[24]  R. Spang,et al.  Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. , 2002, Molecular biology and evolution.

[25]  R. A. Doney,et al.  4. Probability and Random Processes , 1993 .

[26]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[28]  David C. Jones,et al.  Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. , 1996, Journal of molecular biology.

[29]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[30]  I Holmes,et al.  An expectation maximization algorithm for training hidden substitution models. , 2002, Journal of molecular biology.

[31]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[32]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[33]  Richard A. Goldstein,et al.  Probabilistic reconstruction of ancestral protein sequences , 1996, Journal of Molecular Evolution.

[34]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[35]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[36]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[37]  K. Holsinger The neutral theory of molecular evolution , 2004 .

[38]  G. Grimmett,et al.  Probability and random processes , 2002 .

[39]  W R Taylor,et al.  Coevolving protein residues: maximum likelihood identification and relationship to structure. , 1999, Journal of molecular biology.

[40]  Lars Arvestad,et al.  Estimation of Reversible Substitution Matrices from Multiple Pairs of Sequences , 1997, Journal of Molecular Evolution.