Local descriptors of protein structure: A systematic analysis of the sequence‐structure relationship in proteins using short‐ and long‐range interactions

Local protein structure representations that incorporate long‐range contacts between residues are often considered in protein structure comparison but have found relatively little use in structure prediction where assembly from single backbone fragments dominates. Here, we introduce the concept of local descriptors of protein structure to characterize local neighborhoods of amino acids including short‐ and long‐range interactions. We build a library of recurring local descriptors and show that this library is general enough to allow assembly of unseen protein structures. The library could on average re‐assemble 83% of 119 unseen structures, and showed little or no performance decrease between homologous targets and targets with folds not represented among domains used to build it. We then systematically evaluate the descriptor library to establish the level of the sequence signal in sets of protein fragments of similar geometrical conformation. In particular, we test whether that signal is strong enough to facilitate correct assignment and alignment of these local geometries to new sequences. We use the signal to assign descriptors to a test set of 479 sequences with less than 40% sequence identity to any domain used to build the library, and show that on average more than 50% of the backbone fragments constituting descriptors can be correctly aligned. We also use the assigned descriptors to infer SCOP folds, and show that correct predictions can be made in many of the 151 cases where PSI‐BLAST was unable to detect significant sequence similarity to proteins in the library. Although the combinatorial problem of simultaneously aligning several fragments to sequence is a major bottleneck compared with single fragment methods, the advantage of the current approach is that correct alignments imply correct long range distance constraints. The lack of these constraints is most likely the major reason why structure prediction methods fail to consistently produce adequate models when good templates are unavailable or undetectable. Thus, we believe that the current study offers new and valuable insight into the prediction of sequence‐structure relationships in proteins. Proteins 2009. © 2008 Wiley‐Liss, Inc.

[1]  Baldomero Oliva,et al.  An automated classification of the structure of protein loops. , 1997, Journal of molecular biology.

[2]  J. Moult,et al.  Ab initio structure prediction for small polypeptides and protein fragments using genetic algorithms , 1995, Proteins.

[3]  D. Baker,et al.  Prediction of local structure in proteins using a library of sequence-structure motifs. , 1998, Journal of molecular biology.

[4]  David E. Kim,et al.  Free modeling with Rosetta in CASP6 , 2005, Proteins.

[5]  P E Bourne,et al.  An alternative view of protein fold space , 2000, Proteins.

[6]  Peter Willett,et al.  A sphere-based descriptor for matching protein structures , 2002, Journal of molecular modeling.

[7]  G J Kleywegt,et al.  Recognition of spatial motifs in protein structures. , 1999, Journal of molecular biology.

[8]  C. Deane,et al.  A novel exhaustive search algorithm for predicting the conformation of polypeptide segments in proteins , 2000, Proteins.

[9]  K Nishikawa,et al.  Segmentation of a protein into structural elements: Analysis and classification of segments , 1988, Journal of protein chemistry.

[10]  M J Rooman,et al.  Automatic definition of recurrent local structure motifs in proteins. , 1990, Journal of molecular biology.

[11]  Maciej Milostan,et al.  CASP6 data processing and automatic evaluation at the protein structure prediction center , 2005, Proteins.

[12]  J. Tiuryn,et al.  Library of local descriptors models the core of proteins accurately , 2007, Proteins.

[13]  David C. Jones Predicting novel protein folds by using FRAGFOLD , 2001, Proteins.

[14]  J M Thornton,et al.  Derivation of 3D coordinate templates for searching structural databases: Application to ser‐His‐Asp catalytic triads in the serine proteinases and lipases , 1996, Protein science : a publication of the Protein Society.

[15]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[16]  T. Blundell,et al.  Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: A database for modeling and prediction , 1996, Protein science : a publication of the Protein Society.

[17]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[18]  Shankar Subramaniam,et al.  Protein local structure prediction from sequence , 2003, Proteins.

[19]  N H Martin,et al.  Men and machines. , 1973, Journal of clinical pathology.

[20]  R. Kolodny,et al.  Protein structure comparison: implications for the nature of 'fold space', and structure and function prediction. , 2006, Current opinion in structural biology.

[21]  H. Valadié,et al.  Extension of a local backbone description using a structural alphabet: A new approach to the sequence‐structure relationship , 2002, Protein science : a publication of the Protein Society.

[22]  A. D. McLachlan,et al.  A mathematical procedure for superimposing atomic coordinates of proteins , 1972 .

[23]  J L Sussman,et al.  A 3D building blocks approach to analyzing and predicting structure of proteins , 1989, Proteins.

[24]  W R Taylor,et al.  A local alignment method for protein structure motifs. , 1993, Journal of molecular biology.

[25]  C. Etchebest,et al.  Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks , 2000, Proteins.

[26]  J. Moult,et al.  An algorithm for determining the conformation of polypeptide segments in proteins by systematic search , 1986, Proteins.

[27]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[28]  P. Munson,et al.  Linkers of secondary structures in proteins , 1997, Protein science : a publication of the Protein Society.

[29]  Richard Bonneau,et al.  Rosetta in CASP4: Progress in ab initio protein structure prediction , 2001, Proteins.

[30]  Ruth Nussinov,et al.  Hierarchical protein folding pathways: A computational study of protein fragments , 2003, Proteins.

[31]  Torgeir R. Hvidsten,et al.  Using multi-data hidden Markov models trained on local neighborhoods of protein structure to predict residue-residue contacts , 2009, Bioinform..

[32]  Raymond H. Myers,et al.  Probability and Statistics for Engineers and Scientists. , 1973 .

[33]  Shankar Subramaniam,et al.  Protein fragment clustering and canonical local shapes , 2003, Proteins.

[34]  M. Levitt,et al.  Small libraries of protein fragments model native protein structures accurately. , 2002, Journal of molecular biology.

[35]  A. Lesk,et al.  Canonical structures for the hypervariable regions of immunoglobulins. , 1987, Journal of molecular biology.

[36]  R. Russell,et al.  Detection of protein three-dimensional side-chain patterns: new examples of convergent evolution. , 1998, Journal of molecular biology.

[37]  K. Fidelis,et al.  Comparison of systematic search and database methods for constructing segments of protein structure. , 1994, Protein engineering.

[38]  Kresten Lindorff-Larsen,et al.  Protein folding and the organization of the protein topology universe. , 2005, Trends in biochemical sciences.

[39]  Ceslovas Venclovas,et al.  Progress over the first decade of CASP experiments , 2005, Proteins.

[40]  J. Richardson,et al.  β-Sheet topology and the relatedness of proteins , 1977, Nature.

[41]  N. Go,et al.  Common spatial arrangements of backbone fragments in homologous and non-homologous proteins. , 1992, Journal of molecular biology.

[42]  M. Levitt,et al.  Protein decoy assembly using short fragments under geometric constraints , 2003, Biopolymers.

[43]  K. Fidelis,et al.  Generalized modeling of enzyme–ligand interactions using proteochemometrics and local protein substructures , 2006, Proteins.

[44]  J. Skolnick,et al.  Structure‐based functional motif identifies a potential disulfide oxidoreductase active site in the serine/threonine protein phosphatase‐1 subfamily , 1999, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[45]  B. L. Sibanda,et al.  Conformation of beta-hairpins in protein structures. A systematic classification with applications to modelling by homology, electron density fitting and protein engineering. , 1989, Journal of molecular biology.

[46]  Frances M. G. Pearl,et al.  Quantifying the similarities within fold space. , 2002, Journal of molecular biology.

[47]  C. Chothia,et al.  Structural patterns in globular proteins , 1976, Nature.

[48]  Steven E Brenner,et al.  The Impact of Structural Genomics: Expectations and Outcomes , 2005, Science.

[49]  P. Willett,et al.  A graph-theoretic approach to the identification of three-dimensional patterns of amino acid side-chains in protein structures. , 1994, Journal of molecular biology.

[50]  C M Deane,et al.  Improved protein loop prediction from sequence alone. , 2001, Protein engineering.

[51]  Adam Godzik,et al.  Connecting the protein structure universe by using sparse recurring fragments. , 2005, Structure.

[52]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[53]  Lars Malmström,et al.  Structure prediction for CASP7 targets using extensive all‐atom refinement with Rosetta@home , 2007, Proteins.

[54]  M G Rossmann,et al.  Comparison of super-secondary structures in proteins. , 1973, Journal of molecular biology.

[55]  N N Alexandrov,et al.  Biological meaning, statistical significance, and classification of local spatial similarities in nonhomologous proteins , 1994, Protein science : a publication of the Protein Society.

[56]  Sung-Hou Kim,et al.  Overview of structural genomics: from structure to function. , 2003, Current opinion in chemical biology.

[57]  D. Baker,et al.  Recurring local sequence motifs in proteins. , 1995, Journal of molecular biology.

[58]  Roland L Dunbrack,et al.  Assessment of fold recognition predictions in CASP6 , 2005, Proteins.

[59]  Nick V. Grishin,et al.  Structural drift: a possible path to protein fold change , 2005, Bioinform..

[60]  R. Nussinov,et al.  Three‐dimensional, sequence order‐independent structural comparison of a serine protease against the crystallographic database reveals active site similarities: Potential implications to evolution and to protein folding , 1994, Protein science : a publication of the Protein Society.

[61]  D Baker,et al.  Global properties of the mapping between local amino acid sequence and local structure in proteins. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[62]  M. Karplus,et al.  PDB-based protein loop prediction: parameters for selection and methods for optimization. , 1997, Journal of molecular biology.

[63]  Jan Komorowski,et al.  A novel approach to fold recognition using sequence-derived properties from sets of structurally similar local fragments of proteins , 2004, Bioinform..

[64]  T. A. Jones,et al.  Using known substructures in protein model building and crystallography. , 1986, The EMBO journal.

[65]  Yang Zhang,et al.  Template‐based modeling and free modeling by I‐TASSER in CASP7 , 2007, Proteins.

[66]  R Nussinov,et al.  Automated multiple structure alignment and detection of a common substructural motif , 2001, Proteins.

[67]  J. Richardson beta-Sheet topology and the relatedness of proteins. , 1977, Nature.

[68]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[69]  Eckart Bindewald,et al.  A divide and conquer approach to fast loop modeling. , 2002, Protein engineering.

[70]  Janusz M Bujnicki,et al.  Protein‐Structure Prediction by Recombination of Fragments , 2006, Chembiochem : a European journal of chemical biology.

[71]  Richard Bonneau,et al.  Ab initio protein structure prediction of CASP III targets using ROSETTA , 1999, Proteins.

[72]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[73]  S. Wodak,et al.  Modelling the polypeptide backbone with 'spare parts' from known protein structures. , 1989, Protein engineering.

[74]  Krzysztof Fidelis,et al.  Progress from CASP6 to CASP7 , 2007, Proteins.

[75]  W. Kabsch A solution for the best rotation to relate two sets of vectors , 1976 .

[76]  A Maritan,et al.  Recurrent oligomers in proteins: An optimal scheme reconciling accurate and concise backbone representations in automated folding and design studies , 2000, Proteins.

[77]  C Kooperberg,et al.  Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. , 1997, Journal of molecular biology.

[78]  Seung Yup Lee,et al.  Analysis of TASSER‐based CASP7 protein structure prediction results , 2007, Proteins.

[79]  William R. Taylor,et al.  A ‘periodic table’ for protein structures , 2002, Nature.