Optimally informative backbone structural propensities in proteins

We use basic ideas from information theory to extract the maximum amount of structural information available in protein sequence data. From a non‐redundant set of protein X‐ray structures, we construct local‐sequence‐dependent [ϕ,ψ] distributions that summarize the influence of local sequence on backbone conformation. These distributions, approximations of actual backbone propensities in the folded protein, have the following properties: (1) They compensate for the problem of scarce data by an optimized combination of local‐sequence‐dependent and single‐residue specific distributions; (2) They use multi‐residue information; (3) They exploit similarities in the local coding properties of amino acids by collapsing the amino acid alphabet to streamline local sequence description; (4) They are designed to contain the maximum amount of local structural information the data set allows. Our methodology is able to extract around 30 cnats of information from the protein data set out of a total 387 cnats of initial uncertainty or entropy in a finely discretized [ϕ,ψ] dihedral angle space (18 × 18 structural states), or about 7.8%. This was achieved at the hexamer length scale; shorter as well as longer fragments produce reduced information gains. The automatic clustering of amino acids into groups, a component of the optimization procedure, reveals patterns consistent with their local coding properties. While the overall information gain from local sequence is small, there are some local sequences that have significantly narrower structural distributions than others. Distribution width varies from at least 20% less than the average overall entropy to at least 14% above. This spread is an expression of the influence of local sequence on the conformational propensities of the backbone chain. The optimal ensemble of local‐sequence‐specific backbone distributions produced is useful as a guide to structural predictions from sequence, as well as a tool for further explorations of the nature of the local protein code. Proteins 2002;48:463–486. © 2002 Wiley‐Liss, Inc.

[1]  L. Goddard Information Theory , 1962, Nature.

[2]  Differential geometry and protein conformation. V. Medium‐range conformational influence of the individual amino acids , 1987, Biopolymers.

[3]  Shoshana J. Wodak,et al.  Identification of predictive sequence motifs limited by protein structure data base size , 1988, Nature.

[4]  G. Casari,et al.  Identification of native protein folds amongst a large number of incorrect models. The calculation of low energy conformations from potentials of mean force. , 1990, Journal of molecular biology.

[5]  M. Sippl Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. , 1990, Journal of molecular biology.

[6]  M. Sippl Calculation of conformational ensembles from potentials of mena force , 1990 .

[7]  J. Gibrat,et al.  Influence of the local amino acid sequence upon the zones of the torsional angles phi and psi adopted by residues in proteins. , 1991, Biochemistry.

[8]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[9]  S. Wodak,et al.  Extracting information on folding from the amino acid sequence: accurate predictions for protein regions with preferred conformation in the absence of tertiary interactions. , 1992, Biochemistry.

[10]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[11]  S Rackovsky On the nature of the protein folding code. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[12]  S. Bryant,et al.  An empirical energy function for threading protein sequence through the folding motif , 1993, Proteins.

[13]  S. Sun,et al.  Reduced representation model of protein structure prediction: Statistical potential and genetic algorithms , 1993, Protein science : a publication of the Protein Society.

[14]  R. Abagyan,et al.  Biased probability Monte Carlo conformational searches and electrostatic calculations for peptides and proteins. , 1994, Journal of molecular biology.

[15]  V. Muñoz,et al.  Intrinsic secondary structure propensities of the amino acids, using statistical ϕ–ψ matrices: Comparison with experimental scales , 1994 .

[16]  U. Hobohm,et al.  Enlarged representative set of protein structures , 1994, Protein science : a publication of the Protein Society.

[17]  M J Sippl,et al.  Knowledge-based potentials for proteins. , 1995, Current opinion in structural biology.

[18]  J. Skolnick,et al.  A reduced model of short range interactions in polypeptide chains , 1995 .

[19]  M. Swindells,et al.  Intrinsic φ,ψ propensities of amino acids, derived from the coil regions of known structures , 1995, Nature Structural Biology.

[20]  B. Lee,et al.  Protein folding by a biased Monte Carlo procedure in the dihedral angle space , 1996, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[21]  D Baker,et al.  Local sequence-structure correlations in proteins. , 1996, Current opinion in biotechnology.

[22]  D. Baker,et al.  Functional rapidly folding proteins from simplified amino acid sequences , 1997, Nature Structural Biology.

[23]  R L Jernigan,et al.  Short‐range conformational energies, secondary structure propensities, and recognition of correct sequence‐structure matches , 1997, Proteins.

[24]  D. Baker,et al.  Prediction of local structure in proteins using a library of sequence-structure motifs. , 1998, Journal of molecular biology.

[25]  S. Griffiths-Jones,et al.  Modulation of intrinsic phi,psi propensities of amino acids by neighbouring residues in the coil regions of protein structures: NMR analysis and dissection of a beta-hairpin peptide. , 1998, Journal of molecular biology.

[26]  R. Jernigan,et al.  An empirical energy potential with a reference state for protein fold and sequence recognition , 1999, Proteins.

[27]  G. Rose,et al.  Is protein folding hierarchic? I. Local structure and peptide folding. , 1999, Trends in biochemical sciences.

[28]  S Rackovsky,et al.  Optimized representations and maximal information in proteins , 2000, Proteins.

[29]  V. Thorsson,et al.  HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. , 2000, Journal of molecular biology.

[30]  N Gibbs,et al.  Ab initio protein structure prediction using physicochemical potentials and a simplified off‐lattice model , 2001, Proteins.