Formal grammars for describing RNA pseudoknotted structure and their application to structure analysis

Recently, much attention has been paid to the structure analysis of biologically important molecules such as nucleic acids and proteins. These structures are hierarchically classified into primary structure, secondary structure and tertiary structure. In this thesis, we focus on RNA (ribonucleic acid) secondary structure determined by interactions between mostly Watson-Crick complementary base pairs. Since base pairs in typical RNAs occur in a nested way, RNA secondary structures have been successfully modeled by context-free grammars (CFGs), and secondary structure prediction has been translated into a parsing problem. On the other hand, there are substructures called pseudoknots where some base pairs occur in a crossed fashion, which cannot be represented by CFGs. Therefore, several formal grammars have been proposed for describing RNA secondary structure including pseudoknots, such as simple linear tree adjoining grammars (SLTAGs), extended SLTAGs (ESLTAGs) and RNA pseudoknot grammars (RPGs). However, the relation between the generative power of each of these grammars has not been clarified so far. The first aim of this thesis is to compare the generative power of the grammars mentioned above by identifying them as subclasses of multiple context-free grammars (MCFGs), which are natural extension of CFGs. More specifically, the following properties are shown: (1) the class of languages generated by RPGs agrees with the class of languages generated by MCFGs with dimension one or two and rank one or two; (2) the class of languages generated by ESLTAGs (ESLT AL) coincides with the class of languages generated by MCFGs with degree five or less; (3) ESLT AL properly ∗Doctoral Dissertation, Department of Information Processing, Graduate School of Information Science, Nara Institute of Science and Technology, NAIST-IS-DD0561011, February 1, 2007.

[1]  J. Ng,et al.  PseudoBase: a database with RNA pseudoknots , 2000, Nucleic Acids Res..

[2]  Tadao Kasami,et al.  RNA Pseudoknotted Structure Prediction Using Stochastic Multiple Context-Free Grammar , 2006 .

[3]  Steve Young,et al.  Applications of stochastic context-free grammars using the Inside-Outside algorithm , 1990 .

[4]  Giorgio Satta,et al.  Restrictions on Tree Adjoining Languages , 1998, ACL.

[5]  Elena Rivas,et al.  The language of RNA: a formal grammar that includes pseudoknots , 2000, Bioinform..

[6]  R. C. Underwood,et al.  Stochastic context-free grammars for tRNA modeling. , 1994, Nucleic acids research.

[7]  Tadao Kasami,et al.  On the Generative Power of Grammars for RNA Secondary Structure , 2005, IEICE Trans. Inf. Syst..

[8]  R. Durbin,et al.  RNA sequence analysis using covariance models. , 1994, Nucleic acids research.

[9]  E Rivas,et al.  A dynamic programming algorithm for RNA structure prediction including pseudoknots. , 1998, Journal of molecular biology.

[10]  Arto Salomaa,et al.  Aspects of Classical Language Theory , 1997, Handbook of Formal Languages.

[11]  Satoshi Kobayashi,et al.  Tree Adjoining Grammars for RNA Structure Prediction , 1999, Theor. Comput. Sci..

[12]  Aravind K. Joshi,et al.  A study of tree adjoining grammars , 1987 .

[13]  Aravind K. Joshi,et al.  Tree Adjunct Grammars , 1975, J. Comput. Syst. Sci..

[14]  S. Altschul,et al.  Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. , 1985, Molecular biology and evolution.

[15]  Sean R. Eddy,et al.  Rfam: an RNA family database , 2003, Nucleic Acids Res..

[16]  Tadao Kasami,et al.  Generalized context-free grammars and multiple context-free grammars , 1989, Systems and Computers in Japan.

[17]  Tadao Kasami,et al.  Membership problem for head languages and multiple context-free languages , 1989, Systems and Computers in Japan.

[18]  Elena Rivas,et al.  Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs , 2000, Bioinform..

[19]  David J. Weir,et al.  Tree Adjoining and Head Wrapping , 1986, COLING.

[20]  Yasubumi Sakakibara,et al.  Splicing on tree-like structures , 1997, DNA Based Computers.

[21]  M Brown,et al.  RNA pseudoknot modeling using intersections of stochastic context free grammars with applications to database search. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[22]  Russell L. Malmberg,et al.  Stochastic modeling of RNA pseudoknotted structures: a grammatical approach , 2003, ISMB.

[23]  Giorgio Satta,et al.  A Two-Dimensional Hierarchy for Parallel Rewriting Systems , 1994 .

[24]  Giorgio Satta,et al.  Independent Parallelism in Finite Copying Parallel Rewriting Systems , 1999, Theor. Comput. Sci..

[25]  Tadao Kasami,et al.  On Multiple Context-Free Grammars , 1991, Theor. Comput. Sci..

[26]  Naoki Abe,et al.  Predicting Protein Secondary Structure Using Stochastic Tree Grammars , 1997, Machine Learning.

[27]  David J. Weir,et al.  The convergence of mildly context-sensitive grammar formalisms , 1990 .

[28]  Tatsuya Akutsu,et al.  Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots , 2000, Discret. Appl. Math..

[29]  Hiroshi Matsui,et al.  Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[30]  Anne Condon,et al.  Problems on RNA Secondary Structure Prediction and Design , 2003, ICALP.

[31]  Aravind K. Joshi,et al.  Tree-Adjoining Grammars , 1997, Handbook of Formal Languages.