Product Grammars for Alignment and Folding

We develop a theory of algebraic operations over linear and context-free grammars that makes it possible to combine simple “atomic” grammars operating on single sequences into complex, multi-dimensional grammars. We demonstrate the utility of this framework by constructing the search spaces of complex alignment problems on multiple input sequences explicitly as algebraic expressions of very simple one-dimensional grammars. In particular, we provide a fully worked frameshift-aware, semiglobal DNA-protein alignment algorithm whose grammar is composed of products of small, atomic grammars. The compiler accompanying our theory makes it easy to experiment with the combination of multiple grammars and different operations. Composite grammars can be written out in LATEX for documentation and as a guide to implementation of dynamic programming algorithms. An embedding in Haskell as a domain-specific language makes the theory directly accessible to writing and using grammar products without the detour of an external compiler. Software and supplemental files available here: http://www.bioinf. uni-leipzig.de/Software/gramprod/.

[1]  Robert Giegerich,et al.  Table design in dynamic programming , 2006, Inf. Comput..

[2]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[3]  Robert Giegerich,et al.  Explaining and Controlling Ambiguity in Dynamic Programming , 2000, CPM.

[4]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[5]  Ralf Bundschuh Computational approaches to insertional RNA editing. , 2007, Methods in enzymology.

[6]  T. Gregory Dewey,et al.  A Sequence Alignment Algorithm with an Arbitrary Gap Penalty Function , 2001, J. Comput. Biol..

[7]  Hamidreza Chitsaz,et al.  A partition function algorithm for interacting nucleic acid strands , 2009, Bioinform..

[8]  Norbert Blum,et al.  Greibach Normal Form Transformation Revisited , 1999, Inf. Comput..

[9]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Robert Giegerich,et al.  Versatile and declarative dynamic programming using pair algebras , 2005, BMC Bioinformatics.

[11]  Rolf Backofen,et al.  SPARSE: quadratic time simultaneous alignment and folding of RNAs without sequence-based heuristics , 2013, RECOMB.

[12]  Christian M. Reidys,et al.  Topology and prediction of RNA pseudoknots , 2011, Bioinform..

[13]  Peter J. Stuckey,et al.  Progressive Multiple Alignment Using Sequence Triplet Optimizations and Three-residue Exchange Costs , 2004, J. Bioinform. Comput. Biol..

[14]  Robert Giegerich,et al.  Semantics and Ambiguity of Stochastic RNA Family Models , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  Peter F Stadler,et al.  Progressive multiple sequence alignments from triplets , 2007, BMC Bioinformatics.

[16]  Sean R. Eddy,et al.  Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction , 2004, BMC Bioinformatics.

[17]  O. Gotoh Alignment of three biological sequences with an efficient traceback procedure. , 1986, Journal of theoretical biology.

[18]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[19]  R. Bundschuh,et al.  Complete characterization of the edited transcriptome of the mitochondrion of Physarum polycephalum using deep sequencing of RNA , 2011, Nucleic acids research.

[20]  Nancy Retzlaff Bigramm-Alignierung und ihre Anwendung in der historischen Linguistik , 2013 .

[21]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[22]  W. Imrich,et al.  Handbook of Product Graphs, Second Edition , 2011 .

[23]  R. Durbin,et al.  RNA sequence analysis using covariance models. , 1994, Nucleic acids research.

[24]  Jerrold R. Griggs,et al.  Algorithms for Loop Matchings , 1978 .

[25]  Christian M. Reidys,et al.  Partition function and base pairing probabilities for RNA-RNA interaction prediction , 2009, Bioinform..

[26]  Simon L. Peyton Jones,et al.  Exploiting vector instructions with generalized stream fusio , 2013, ICFP.

[27]  Sheila A. Greibach,et al.  A New Normal-Form Theorem for Context-Free Phrase Structure Grammars , 1965, JACM.

[28]  Peter F. Stadler,et al.  How to Multiply Dynamic Programming Algorithms , 2013, BSB.

[29]  Ralf Bundschuh,et al.  Discovery of new genes and deletion editing in Physarum mitochondria enabled by a novel algorithm for finding edited mRNAs , 2005, Nucleic acids research.

[30]  Geoffrey Mainland Why it's nice to be quoted: quasiquoting for haskell , 2007, Haskell '07.

[31]  D. Sankoff,et al.  An ancestral mitochondrial DNA resembling a eubacterial genome in miniature , 1997, Nature.

[32]  Andrzej Ehrenfeucht,et al.  An Easy Proof of Greibach Normal Form , 1984, Inf. Control..

[33]  Fabrice Lefebvre An Optimized Parsing Algorithm Well Suited to RNA Folding , 1995, ISMB.

[34]  Ha Youn Lee,et al.  Genome annotation in the presence of insertional RNA editing , 2008, Bioinform..

[35]  Michael Cysouw,et al.  A Pipeline for Computational Historical Linguistics , 2011 .

[36]  Simon L. Peyton Jones,et al.  Template meta-programming for Haskell , 2002, Haskell '02.

[37]  Sean R. Eddy,et al.  Infernal 1.0: inference of RNA alignments , 2009, Bioinform..

[38]  Jan Arne Telle,et al.  Space-Efficient Construction Variants of Dynamic Programming , 2004, Nord. J. Comput..

[39]  Sean R. Eddy,et al.  Infernal 1.0: inference of RNA alignments , 2009, Bioinform..

[40]  Y. Miyazawa,et al.  The complete DNA sequence of the mitochondrial genome of Physarum polycephalum , 2001, Molecular and General Genetics MGG.

[41]  Christian Höner zu Siederdissen,et al.  Sneaking around concatMap: efficient combinators for dynamic programming , 2012, ICFP.

[42]  Janet Kelso,et al.  Computational challenges in the analysis of ancient DNA , 2010, Genome Biology.

[43]  Robert Giegerich,et al.  A discipline of dynamic programming over sequence data , 2004, Sci. Comput. Program..

[44]  D. Sankoff Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems , 1985 .

[45]  Simon L. Peyton Jones,et al.  Regular, shape-polymorphic, parallel arrays in Haskell , 2010, ICFP '10.

[46]  Robert Giegerich,et al.  Algebraic Dynamic Programming , 2002, AMAST.

[47]  Roman Leshchinskiy,et al.  Stream fusion: from lists to streams to nothing at all , 2007, ICFP '07.