Rna modeling using stochastic context-free grammars

Recent developments in high-throughput biological technologies have created a wealth of biological sequence data. The immense size of these biological datasets has prompted the use of computational methods for their analysis. This work presents the theory and application of stochastic context-free grammars (SCFGs) to biological sequence analysis and specifically to the problem of RNA secondary structure modeling. SCFGs are a method of characterizing biological sequences that take into account the statistical identity of different sequence positions including pairwise interactions between positions. It is their ability to model pairwise interacting positions that make SCFGs a natural mathematical model of RNA secondary structure. SCFGs can automatically generate structural multiple alignments of RNA families that take into account basepairing interactions. SCFGs are presented as an extension of another probabilistic model used in biological sequence analysis, hidden Markov models. I present several SCFG algorithm developments including a SCFG constraint system that gives significant performance enhancements in both time and space and allows large SCFGs to be applied to large sequence analysis problems. I give a method using intersected SCFGs to model non-context-free structures. I also introduce a new method of sequence classification using a support vector machine framework and feature vectors generated from a SCFG. I apply the SCFG method to an in vitro selected RNA pseudoknot that binds biotin. Even though SCFGs cannot model the RNA pseudoknot structure directly, I show that an approximation using two SCFGs can effectively perform database searches and find RNA pseudoknot structures. I then apply SCFGs to modeling small subunit ribosomal RNA, a large molecule that is important to the construction of phylogenetic trees of life. I compare the SCFG method to several other methods in constructing multiple alignments of this molecule and show that the SCFG outperforms the other methods, attaining a multiple alignment whose quality is close to hand-edited alignments. I apply SCFGs with support vector machines to a phylogenetic classification problem and show that they outperform a standard method. I describe the SCFG RNA modeling software, RNACAD, that was used in this work.