Constructing amino acid residue substitution classes maximally indicative of local protein structure

Using an information theoretic formalism, we optimize classes of amino acid substitution to be maximally indicative of local protein structure. Our statistically‐derived classes are loosely identifiable with the heuristic constructions found in previously published work. However, while these other methods provide a more rigid idealization of physicochemically constrained residue substitution, our classes provide substantially more structural information with many fewer parameters. Moreover, these substitution classes are consistent with the paradigmatic view of the sequence‐to‐structure relationship in globular proteins which holds that the three‐dimensional architecture is predominantly determined by the arrangement of hydrophobic and polar side chains with weak constraints on the actual amino acid identities. More specific constraints are imposed on the placement of prolines, glycines, and the charged residues. These substitution classes have been used in highly accurate predictions of residue solvent accessibility. They could also be used in the identification of homologous proteins, the construction and refinement of multiple sequence alignments, and as a means of condensing and codifying the information in multiple sequence alignments for secondary structure prediction and tertiary fold recognition. © 1996 Wiley‐Liss, Inc.

[1]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[2]  Claude E. Shannon,et al.  The mathematical theory of communication , 1950 .

[3]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[4]  M. Kimura Evolutionary Rate at the Molecular Level , 1968, Nature.

[5]  J. L. King,et al.  Non-Darwinian evolution. , 1969, Science.

[6]  A. Shrake,et al.  Environment and exposure to solvent of protein atoms. Lysozyme and insulin. , 1973, Journal of molecular biology.

[7]  R. Dickerson,et al.  The cytochrome fold and the evolution of bacterial energy metabolism. , 1976, Journal of molecular biology.

[8]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1977, Journal of molecular biology.

[9]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[10]  J. Garnier,et al.  Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. , 1978, Journal of molecular biology.

[11]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[12]  H A Scheraga,et al.  Improvements in the prediction of protein backbone topography by reduction of statistical errors. , 1979, Biochemistry.

[13]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[14]  C. Chothia Principles that determine the structure of proteins. , 1984, Annual review of biochemistry.

[15]  W R Taylor,et al.  Recognition of super-secondary structure in proteins. , 1984, Journal of molecular biology.

[16]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[17]  The predicted secondary structure of enolase. , 1986, The Biochemical journal.

[18]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[19]  P. Terpstra,et al.  Prediction of the Occurrence of the ADP-binding βαβ-fold in Proteins, Using an Amino Acid Sequence Fingerprint , 1986 .

[20]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[21]  J. Gibrat,et al.  Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs. , 1987, Journal of molecular biology.

[22]  M. Sternberg,et al.  Prediction of protein secondary structure and active sites using the alignment of homologous sequences. , 1987, Journal of molecular biology.

[23]  R H Lathrop,et al.  Prediction of a common structural domain in aminoacyl-tRNA synthetases through use of a new pattern-directed inference system. , 1987, Biochemistry.

[24]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[25]  I. Crawford,et al.  Prediction of secondary structure by evolutionary comparison: Application to the α subunit of tryptophan synthase , 1987, Proteins.

[26]  Robert B Sim,et al.  A study of the structure of human complement component factor H by Fourier transform infrared spectroscopy and secondary structure averaging methods. , 1988, Biochemistry.

[27]  R. Sauer,et al.  Combinatorial cassette mutagenesis as a probe of the informational content of protein sequences. , 1988, Science.

[28]  G. Fasman Prediction of Protein Structure and the Principles of Protein Conformation , 2012, Springer US.

[29]  J. Garnier,et al.  The GOR Method for Predicting Secondary Structures in Proteins , 1989 .

[30]  J. Richardson,et al.  Principles and Patterns of Protein Conformation , 1989 .

[31]  N. D. Clarke,et al.  Identification of protein folds: Matching hydrophobicity patterns of sequence sets with solvent accessibility patterns of known structures , 1990, Proteins.

[32]  R. F. Smith,et al.  Automatic generation of primary sequence patterns from sets of related protein sequences. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[33]  A M Lesk,et al.  Comparison of the structures of globins and phycocyanins: Evidence for evolutionary relationship , 1990, Proteins.

[34]  M. Sternberg,et al.  Flexible protein sequence patterns. A sensitive method to detect weak structural similarities. , 1990, Journal of molecular biology.

[35]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[36]  A. Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[37]  A. D. McLachlan,et al.  Secondary structure‐based profiles: Use of structure‐conserving scoring tables in searching protein sequence databases for structural similarities , 1991, Proteins.

[38]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[39]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[40]  S. Benner,et al.  Patterns of divergence in homologous proteins as indicators of secondary and tertiary structure: a prediction of the structure of the catalytic domain of protein kinases. , 1991, Advances in enzyme regulation.

[41]  T. D. Schneider,et al.  Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. , 1992, Journal of molecular biology.

[42]  M J Sternberg,et al.  Evaluation of the sequence template method for protein structure prediction. Discrimination of the (beta/alpha)8-barrel fold. , 1992, Journal of molecular biology.

[43]  A. Lapedes,et al.  Determination of eukaryotic protein coding regions using neural networks and information theory. , 1992, Journal of molecular biology.

[44]  Peer Bork,et al.  Mobile modules and motifs , 1992, Current Biology.

[45]  Chris Sander,et al.  Jury returns on structure prediction , 1992, Nature.

[46]  Smith Rf,et al.  Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for use in comparative protein modelling. , 1992 .

[47]  R A Goldstein,et al.  Three-dimensional model for the hormone binding domains of steroid receptors. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[48]  P. Argos,et al.  Quantification of secondary structure prediction improvement using multiple alignments. , 1993, Protein engineering.

[49]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[50]  T. D. Schneider,et al.  Information analysis of sequences that bind the replication initiator RepA. , 1993, Journal of molecular biology.

[51]  Robert B. Russell,et al.  Protein structure prediction , 1993, Nature.

[52]  L. H. Bradley,et al.  Protein design by binary patterning of polar and nonpolar amino acids. , 1993, Methods in molecular biology.

[53]  B. Rost,et al.  Combining evolutionary information and neural networks to predict protein secondary structure , 1994, Proteins.

[54]  C. Sander,et al.  Correlated mutations and residue contacts in proteins , 1994, Proteins.

[55]  B. Rost,et al.  Conservation and prediction of solvent accessibility in protein families , 1994, Proteins.

[56]  S A Benner,et al.  Bona fide prediction of aspects of protein conformation. Assigning interior and surface residues from patterns of variation and conservation in homologous protein sequences. , 1994, Journal of molecular biology.

[57]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank: current status. , 1994, Nucleic acids research.

[58]  C. Sander,et al.  Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? , 1994, Protein engineering.

[59]  A. Bairoch The ENZYME data bank. , 1993, Nucleic acids research.

[60]  U. Hobohm,et al.  Enlarged representative set of protein structures , 1994, Protein science : a publication of the Protein Society.

[61]  T L Blundell,et al.  Use of amino acid environment-dependent substitution tables and conformational propensities in structure prediction from aligned sequences of homologous proteins. I. Solvent accessibility classes. , 1994, Journal of molecular biology.

[62]  E. Neher How frequent are correlated changes in families of protein sequences? , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[63]  K. Hatrick,et al.  Compensating changes in protein multiple sequence alignments. , 1994, Protein engineering.

[64]  T L Blundell,et al.  Use of amino acid environment-dependent substitution tables and conformational propensities in structure prediction from aligned sequences of homologous proteins. II. Secondary structures. , 1994, Journal of molecular biology.

[65]  P. Argos,et al.  Protein structure prediction: recognition of primary, secondary, and tertiary structural features from amino acid sequence. , 1995, Critical reviews in biochemistry and molecular biology.

[66]  R. M. Williamson Information theory analysis of the relationship between primary sequence structure and ligand recognition among a class of facilitated transporters. , 1995, Journal of theoretical biology.

[67]  A A Salamov,et al.  Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments. , 1995, Journal of molecular biology.

[68]  R A Goldstein,et al.  Predicting solvent accessibility: Higher accuracy using Bayesian statistics and optimized residue substitution classes , 1996, Proteins.