Discovering Empirically Conserved Amino Acid Substitution Groups in Databases of Protein Families

This paper introduces a method for identifying empirically conserved amino acid substitution groups. In contrast with existing approaches that view amino acid substitution as a pairwise phenomenon, the method presented here identifies conserved groups of amino acids using a data structure called a conditional distribution matrix. The conditional distribution matrix extends the concept of a pairwise substitution matrix by changing the context of substitution from a single amino acid to a group of amino acids. The matrix tabulates information from a database of protein families that contains numerous aligned positions. Each row in the matrix contains the distribution of amino acids in those aligned positions that contain a given conditioning group of amino acids. The method converts a database of protein families into a conditional distribution matrix and then examines each possible substitution group for evidence of conservation. The algorithm is applied to the BLOCKS and HSSP databases. Twenty amino acid substitution groups are found to be conserved empirically in both databases. These groups provide insight into biochemical properties that are conserved in protein evolution.

[1]  M Vingron,et al.  Weighting in sequence space: a comparison of methods in terms of generalized sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[2]  P. Argos,et al.  Weighting aligned protein or nucleic acid sequences to correct for unequal representation. , 1990, Journal of molecular biology.

[3]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[4]  P Bork,et al.  Recognition of functional regions in primary structures using a set of property patterns , 1989, FEBS letters.

[5]  H. Scheraga,et al.  Statistical analysis of the physical properties of the 20 naturally occurring amino acids , 1985 .

[6]  Douglas L. Brutlag,et al.  Identification of Protein Motifs Using Conserved Amino Acid Properties and Partitioning Techniques , 1995, ISMB.

[7]  G. Mocz Fuzzy cluster analysis of simple physicochemical properties of amino acids for recognizing secondary structure in proteins , 1995, Protein science : a publication of the Protein Society.

[8]  P. Argos,et al.  An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. , 1995, Journal of molecular biology.

[9]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[10]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[11]  A. Mclachlan Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c 551 . , 1971, Journal of molecular biology.

[12]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[13]  G. Barton Protein multiple sequence alignment and flexible pattern matching. , 1990, Methods in enzymology.

[14]  A. Goldman,et al.  How to make my blood boil. , 1995, Structure.

[15]  Hamilton O. Smith,et al.  Finding sequence motifs in groups of functionally related proteins. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[16]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[17]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[18]  R. F. Smith,et al.  Automatic generation of primary sequence patterns from sets of related protein sequences. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[19]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[20]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[21]  A. Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[22]  S F Altschul,et al.  Weights for data related by a tree. , 1989, Journal of molecular biology.

[23]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[24]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[25]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.