论文信息 - A New Method for Predicting Protein Secondary Structures Based on Stochastic Tree Grammars

A New Method for Predicting Protein Secondary Structures Based on Stochastic Tree Grammars

Abstract We propose a new method for predicting protein secondary structure of a given amino acid sequence, based on a training algorithm for the probability parameters of a certain type of stochastic tree grammars. In particular, we concentrate on the problem of predicting β-sheet regions, which has previously been considered difficult because of the unbounded dependencies exhibited by sequences corresponding to β-sheets. To cope with this difficulty, we use a new family of stochastic tree grammars, which we call Stochastic Ranked Node Rewriting Grammars (SRNRG), which are powerful enough to capture the type of dependencies exhibited by the sequences of β-sheet regions, such as the ‘parallel’ and ‘anti-parallel’ dependencies and their combinations. Our learning algorithm is an adaptation of the ‘Inside-Outside’ algorithm (for Stochastic CFG) to SRNRG with a couple of significant modifications: By placing a restriction on the form of SRNRG, we devised a simpler and faster learning algorithm, and the algorithm is equipped with a new iterative way of reducing the alphabet size (i.e. the number of amino acids) by clustering them using their physico-chemical properties. Our preliminary experiments indicate that our method is able to capture and generalize the kind of long-distance dependencies exhibited by β-sheets, which was previously not possible. Our method was actually able to predict the β-sheet regions of a protein that is less than 25 per cent homologous to the sequences in the training data.

Naoki Abe | Hiroshi Mamitsuka | N. Abe | Hiroshi Mamitsuka

[1] Aravind K. Joshi,et al. Some Computational Properties of Tree Adjoining Grammars , 1985, Annual Meeting of the Association for Computational Linguistics.

[2] David B. Searls,et al. The computational linguistics of biological sequences , 1993, ISMB 1995.

[3] D. Haussler,et al. Stochastic context-free grammars for modeling RNA , 1993, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[4] L. R. Rabiner,et al. An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[5] U. Hobohm,et al. Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[6] C. Sander,et al. Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[7] Yves Schabes,et al. Stochastic Lexicalized Tree-adjoining Grammars , 1992, COLING.

[8] R F Doolittle,et al. Relationships of human protein sequences to those of other organisms. , 1986, Cold Spring Harbor symposia on quantitative biology.

[9] B. Rost,et al. Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[10] Stephen Muggleton,et al. Protein secondary structure prediction using logic-based machine learning , 1992 .

[11] T. Sejnowski,et al. Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[12] Kenji Yamanishi,et al. Protein Secondary Structure Prediction Based on Stochastic-Rule Learning , 1992, ALT.

[13] Naoki Abe,et al. Feasible Learnability of Formal Grammars and The Theory of Natural Language Acquisition , 1988, COLING.