Prediction of Beta-Sheet Structures Using Stochastic Tree Grammars

We empirically demonstrate the effectiveness of a method of predicting protein secondary structures, 13-sheet regions in particular, using a class of stochastic tree grammars as representational language for their amino acid sequence patterns. The family of stochastic tree grammars we use, the Stochastic Ranked Node Rewriting Grammars (SRNRG), is one of the rare families of stochastic grammars that are expressive enough to capture the kind of long-distance dependencies exhibited by the sequences of 33-sheet regions, and at the same time enjoy relatively efficient processing. We applied our method on real data obtained from the HSSP database and the results obtained are encouraging: Using an SRNRG trained by data of a particular protein, our method was actually able to predict the location and structure of ,3-sheet regions in a number of different proteins, whose sequences are less than 25 per cent homologous to the training sequences. The learning algorithm we use is an extension of the 'Inside-Outside' algorithm for stochastic context free grammars, but with a number of significant modifications. First, we restricted the grammars used to be members of the `linear' subclass of SRNRG, and devised simpler and faster algorithms for this subclass. Secondly, we reduced the alphabet size (i.e. the number of amino acids) by clustering them using their physico-chemical properties, gradually through the iterations of the learning algorithm. Our experiments indicate that our prediction method not only goes beyond what is possible by alignment alone, but the grammar that was acquired by our learning algorithm captures the type of long distance dependencies that could not be succinctly expressed by an HMM. We also stress that our method can predict the structure as well as the location of ƒÀ-sheet regions, which was not possible by previous inverse protein folding methods. 1馬 見塚 拓 、安倍 直樹:RWCP、 理論NEC研 究室 、c/o NEC C&C研 究 所,〒216川 崎 市宮 前区宮崎4-1-1。 2Real World Computing Partnership .

[1]  D. Haussler,et al.  Stochastic context-free grammars for modeling RNA , 1993, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[2]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[3]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[4]  Frederick Jelinek,et al.  Basic Methods of Probabilistic Context Free Grammars , 1992 .

[5]  Kenji Yamanishi,et al.  Protein Secondary Structure Prediction Based on Stochastic-Rule Learning , 1992, ALT.

[6]  Naoki Abe,et al.  Feasible Learnability of Formal Grammars and The Theory of Natural Language Acquisition , 1988, COLING.

[7]  Aravind K. Joshi,et al.  Some Computational Properties of Tree Adjoining Grammars , 1985, Annual Meeting of the Association for Computational Linguistics.

[8]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[9]  David B. Searls,et al.  The computational linguistics of biological sequences , 1993, ISMB 1995.

[10]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[11]  T. Salakoski,et al.  Selection of a representative set of structures from brookhaven protein data bank , 1992, Proteins.

[12]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[13]  T. L. Blundell,et al.  Knowledge-based prediction of protein structures and the design of novel molecules , 1987, Nature.

[14]  Naoki Abe,et al.  A New Method for Predicting Protein Secondary Structures Based on Stochastic Tree Grammars , 1994, ICML.

[15]  Satoru Hayamizu,et al.  HMM with protein structure grammar , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[16]  R F Doolittle,et al.  Relationships of human protein sequences to those of other organisms. , 1986, Cold Spring Harbor symposia on quantitative biology.

[17]  R. Durbin,et al.  RNA sequence analysis using covariance models. , 1994, Nucleic acids research.

[18]  Aravind K. Joshi,et al.  Tree Adjunct Grammars , 1975, J. Comput. Syst. Sci..