A New Method for Predicting Protein Secondary Structures Based on Stochastic Tree Grammars

Abstract We propose a new method for predicting protein secondary structure of a given amino acid sequence, based on a training algorithm for the probability parameters of a certain type of stochastic tree grammars. In particular, we concentrate on the problem of predicting β-sheet regions, which has previously been considered difficult because of the unbounded dependencies exhibited by sequences corresponding to β-sheets. To cope with this difficulty, we use a new family of stochastic tree grammars, which we call Stochastic Ranked Node Rewriting Grammars (SRNRG), which are powerful enough to capture the type of dependencies exhibited by the sequences of β-sheet regions, such as the ‘parallel’ and ‘anti-parallel’ dependencies and their combinations. Our learning algorithm is an adaptation of the ‘Inside-Outside’ algorithm (for Stochastic CFG) to SRNRG with a couple of significant modifications: By placing a restriction on the form of SRNRG, we devised a simpler and faster learning algorithm, and the algorithm is equipped with a new iterative way of reducing the alphabet size (i.e. the number of amino acids) by clustering them using their physico-chemical properties. Our preliminary experiments indicate that our method is able to capture and generalize the kind of long-distance dependencies exhibited by β-sheets, which was previously not possible. Our method was actually able to predict the β-sheet regions of a protein that is less than 25 per cent homologous to the sequences in the training data.

[1]  Aravind K. Joshi,et al.  Some Computational Properties of Tree Adjoining Grammars , 1985, Annual Meeting of the Association for Computational Linguistics.

[2]  David B. Searls,et al.  The computational linguistics of biological sequences , 1993, ISMB 1995.

[3]  D. Haussler,et al.  Stochastic context-free grammars for modeling RNA , 1993, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[4]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[5]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[6]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[7]  Yves Schabes,et al.  Stochastic Lexicalized Tree-adjoining Grammars , 1992, COLING.

[8]  R F Doolittle,et al.  Relationships of human protein sequences to those of other organisms. , 1986, Cold Spring Harbor symposia on quantitative biology.

[9]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[10]  Stephen Muggleton,et al.  Protein secondary structure prediction using logic-based machine learning , 1992 .

[11]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[12]  Kenji Yamanishi,et al.  Protein Secondary Structure Prediction Based on Stochastic-Rule Learning , 1992, ALT.

[13]  Naoki Abe,et al.  Feasible Learnability of Formal Grammars and The Theory of Natural Language Acquisition , 1988, COLING.