A hidden markov model derived structural alphabet for proteins.

Understanding and predicting protein structures depends on the complexity and the accuracy of the models used to represent them. We have set up a hidden Markov model that discretizes protein backbone conformation as series of overlapping fragments (states) of four residues length. This approach learns simultaneously the geometry of the states and their connections. We obtain, using a statistical criterion, an optimal systematic decomposition of the conformational variability of the protein peptidic chain in 27 states with strong connection logic. This result is stable over different protein sets. Our model fits well the previous knowledge related to protein architecture organisation and seems able to grab some subtle details of protein organisation, such as helix sub-level organisation schemes. Taking into account the dependence between the states results in a description of local protein structure of low complexity. On an average, the model makes use of only 8.3 states among 27 to describe each position of a protein structure. Although we use short fragments, the learning process on entire protein conformations captures the logic of the assembly on a larger scale. Using such a model, the structure of proteins can be reconstructed with an average accuracy close to 1.1A root-mean-square deviation and for a complexity of only 3. Finally, we also observe that sequence specificity increases with the number of states of the structural alphabet. Such models can constitute a very relevant approach to the analysis of protein architecture in particular for protein structure prediction.

[1]  J. Wójcik,et al.  New efficient statistical sequence-dependent structure prediction of short to medium-sized protein loops based on an exhaustive loop classification. , 1999, Journal of molecular biology.

[2]  T. A. Jones,et al.  Using known substructures in protein model building and crystallography. , 1986, The EMBO journal.

[3]  C Sander,et al.  On the use of sequence homologies to predict protein structure: identical pentapeptides can have completely different conformations. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[4]  A Maritan,et al.  Recurrent oligomers in proteins: An optimal scheme reconciling accurate and concise backbone representations in automated folding and design studies , 2000, Proteins.

[5]  V. Thorsson,et al.  HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. , 2000, Journal of molecular biology.

[6]  Ruth Nussinov,et al.  fragment folding and assembly Reducing the computational complexity of protein folding via , 2002 .

[7]  S J Prestrelski,et al.  Generation of a substructure library for the description and classification of protein secondary structure. I. Overview of the methods and results , 1992, Proteins.

[8]  A. Sali,et al.  Modeling of loops in protein structures , 2000, Protein science : a publication of the Protein Society.

[9]  Richard Bonneau,et al.  Rosetta in CASP4: Progress in ab initio protein structure prediction , 2001, Proteins.

[10]  D. Baker,et al.  Prediction of local structure in proteins using a library of sequence-structure motifs. , 1998, Journal of molecular biology.

[11]  R. Katz On Some Criteria for Estimating the Order of a Markov Chain , 1981 .

[12]  Shankar Subramaniam,et al.  Protein fragment clustering and canonical local shapes , 2003, Proteins.

[13]  S. Kumar,et al.  Geometrical and sequence characteristics of alpha-helices in globular proteins. , 1998, Biophysical journal.

[14]  Pierre Baldi,et al.  Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles , 2002, Proteins.

[15]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[16]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[17]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[18]  P Tufféry,et al.  XmMol: an X11 and motif program for macromolecular visualization and modeling. , 1995, Journal of molecular graphics.

[19]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[20]  J. Thornton,et al.  Helix geometry in proteins. , 1988, Journal of molecular biology.

[21]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[22]  Ron Unger,et al.  The importance of short structural motifs in protein structure analysis , 1993, J. Comput. Aided Mol. Des..

[23]  P. Argos,et al.  Knowledge‐based protein secondary structure assignment , 1995, Proteins.

[24]  J M Thornton,et al.  Conformation of beta hairpins in protein structures: classification and diversity in homologous structures. , 1991, Methods in enzymology.

[25]  B. L. Sibanda,et al.  [5] Conformation of β hairpins in protein structures: Classification and diversity in homologous structures , 1991 .

[26]  M. Levitt,et al.  Small libraries of protein fragments model native protein structures accurately. , 2002, Journal of molecular biology.

[27]  J L Sussman,et al.  A 3D building blocks approach to analyzing and predicting structure of proteins , 1989, Proteins.

[28]  Baldomero Oliva,et al.  An automated classification of the structure of protein loops. , 1997, Journal of molecular biology.

[29]  Pierre Tufféry,et al.  Exploring the use of a structural alphabet for structural prediction of protein loops , 2001 .

[30]  L. Pauling,et al.  Fundamental dimensions of polypeptide chains , 1953, Proceedings of the Royal Society of London. Series B - Biological Sciences.

[31]  M. Palumbo,et al.  Patterns, structures, and amino acid frequencies in structural building blocks, a protein secondary structure classification scheme , 1997, Proteins.

[32]  Pierre Tufféry,et al.  Analyzing patterns between regular secondary structures using short structural building blocks defined by a hidden Markov model , 1999 .

[33]  Manju Bansal,et al.  Geometrical and Sequence Characteristics of α-Helices in Globular Proteins , 1998 .

[34]  M. Levitt Accurate modeling of protein conformation by automatic segment matching. , 1992, Journal of molecular biology.

[35]  M J Rooman,et al.  Automatic definition of recurrent local structure motifs in proteins. , 1990, Journal of molecular biology.

[36]  Christopher Bystroff,et al.  Fully automated ab initio protein structure prediction using I-STES, HMMSTR and ROSETTA , 2002, ISMB.

[37]  M. Levitt,et al.  The complexity and accuracy of discrete state models of protein structure. , 1995, Journal of molecular biology.

[38]  J M Thornton,et al.  Analysis of domain structural class using an automated class assignment protocol. , 1996, Journal of molecular biology.

[39]  H. Valadié,et al.  Extension of a local backbone description using a structural alphabet: A new approach to the sequence‐structure relationship , 2002, Protein science : a publication of the Protein Society.

[40]  C. Etchebest,et al.  Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks , 2000, Proteins.

[41]  Richard Bonneau,et al.  Ab initio protein structure prediction of CASP III targets using ROSETTA , 1999, Proteins.

[42]  James E. Bray,et al.  The CATH database: an extended protein family resource for structural and functional genomics , 2003, Nucleic Acids Res..

[43]  J F Boisvieux,et al.  Hidden Markov model approach for identifying the modular framework of the protein backbone. , 1999, Protein engineering.

[44]  J Schuchhardt,et al.  Local structural motifs of protein backbones are classified by self-organizing neural networks. , 1996, Protein engineering.

[45]  G. Celeux,et al.  A stochastic approximation type EM algorithm for the mixture problem , 1992 .