Recent Progresses in the Linguistic Modeling of Biological Sequences Based on Formal Language Theory

Treating genomes just as languages raises the possibility of producing concise generalizations about information in biological sequences. Grammars used in this way would constitute a model of underlying biological processes or structures, and that grammars may, in fact, serve as an appropriate tool for theory formation. The increasing number of biological sequences that have been yielded further highlights a growing need for developing grammatical systems in bioinformatics. The intent of this review is therefore to list some bibliographic references regarding the recent progresses in the field of grammatical modeling of biological sequences. This review will also contain some sections to briefly introduce basic knowledge about formal language theory, such as the Chomsky hierarchy, for non-experts in computational linguistics, and to provide some helpful pointers to start a deeper investigation into this field.

[1]  Damián López,et al.  Protein Motif Prediction by Grammatical Inference , 2006, ICGI.

[2]  Christian J. A. Sigrist,et al.  ProRule: a new database containing functional and structural information on PROSITE profiles , 2005, Bioinform..

[3]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[4]  A. Apostolico,et al.  Off-line compression by greedy textual substitution , 2000, Proceedings of the IEEE.

[5]  Serafim Batzoglou,et al.  CONTRAfold: RNA secondary structure prediction without physics-based models , 2006, ISMB.

[6]  Yasubumi Sakakibara,et al.  Pair hidden Markov models on tree structures , 2003, ISMB.

[7]  Elena Rivas,et al.  The language of RNA: a formal grammar that includes pseudoknots , 2000, Bioinform..

[8]  Satoshi Kobayashi,et al.  Learning local languages and its application to protein /spl alpha/-chain identification , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[9]  Richard E. Ladner,et al.  Grammar-based Compression of DNA Sequences , 2007 .

[10]  Matthias Gallé,et al.  Searching for smallest grammars on large sequences and application to DNA , 2012, J. Discrete Algorithms.

[11]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[12]  R. Durbin,et al.  RNA sequence analysis using covariance models. , 1994, Nucleic acids research.

[13]  Chris Mellish,et al.  Basic Gene Grammars and DNA-ChartParser for language processing of Escherichia coli promoter DNA sequences , 2001, Bioinform..

[14]  Ian Holmes,et al.  Stem Stem Stem Stem Loop Loop Loop LoopLoop Loop Loop Loop Loop Loop Loop , 2005 .

[15]  Satoshi Kobayashi,et al.  Tree Adjoining Grammars for RNA Structure Prediction , 1999, Theor. Comput. Sci..

[16]  En-Hui Yang,et al.  Estimating DNA sequence entropy , 2000, SODA '00.

[17]  Sean R. Eddy,et al.  Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction , 2004, BMC Bioinformatics.

[18]  Russell L. Malmberg,et al.  Stochastic modeling of RNA pseudoknotted structures: a grammatical approach , 2003, ISMB.

[19]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[20]  Robert D. Finn,et al.  Rfam: updates to the RNA families database , 2008, Nucleic Acids Res..

[21]  Hiroshi Matsui,et al.  Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[22]  Yasubumi Sakakibara,et al.  Grammatical inference in bioinformatics , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Naoki Abe,et al.  A New Method for Predicting Protein Secondary Structures Based on Stochastic Tree Grammars , 1994, ICML.

[24]  Andrew Clayphan,et al.  Computational inference of grammars for larger-than-gene structures from annotated gene sequences , 2011, Bioinform..

[25]  J. Collado-Vides,et al.  Grammatical model of the regulation of gene expression. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Damián López,et al.  IgTM: An algorithm to predict transmembrane domains and topology in proteins , 2008, BMC Bioinformatics.

[27]  Ashwin Srinivasan,et al.  Prediction of novel precursor miRNAs using a context-sensitive hidden Markov model (CSHMM) , 2010, BMC Bioinformatics.

[28]  T. Head Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviors. , 1987, Bulletin of mathematical biology.

[29]  David B. Searls,et al.  The computational linguistics of biological sequences , 1993, ISMB 1995.

[30]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[31]  F. Baquero From pieces to patterns: evolutionary engineering in bacterial pathogens , 2004, Nature Reviews Microbiology.

[32]  Hong Yan,et al.  Pattern recognition techniques for the emerging field of bioinformatics: A review , 2005, Pattern Recognit..

[33]  Christian M. Reidys,et al.  Topology and prediction of RNA pseudoknots , 2011, Bioinform..

[34]  Aravind K. Joshi,et al.  Tree Adjunct Grammars , 1975, J. Comput. Syst. Sci..

[35]  Ian Holmes,et al.  Pairwise RNA Structure Comparison with Stochastic Context-Free Grammars , 2001, Pacific Symposium on Biocomputing.

[36]  David H. D. Warren,et al.  Definite Clause Grammars for Language Analysis - A Survey of the Formalism and a Comparison with Augmented Transition Networks , 1980, Artif. Intell..

[37]  Craig G. Nevill-Manning,et al.  Compression and Explanation Using Hierarchical Grammars , 1997, Comput. J..

[38]  Denis Thieffry,et al.  Syntactic recognition of regulatory regions in Escherichia coli , 1996, Comput. Appl. Biosci..

[39]  Jean-Christophe Nebel,et al.  A stochastic context free grammar based framework for analysis of protein sequences , 2009, BMC Bioinformatics.

[40]  E. Coiera,et al.  Gene cassettes and cassette arrays in mobile resistance integrons. , 2009, FEMS microbiology reviews.

[41]  David B. Searls Representing Genetic Information with Formal Grammars , 1988, AAAI.

[42]  D. Searls,et al.  Robots in invertebrate neuroscience , 2002, Nature.

[43]  Stefano Lonardi,et al.  Compression of biological sequences by greedy off-line textual substitution , 2000, Proceedings DCC 2000. Data Compression Conference.

[44]  Enrico W. Coiera,et al.  Context-driven discovery of gene cassettes in mobile integrons using a computational grammar , 2009, BMC Bioinformatics.

[45]  Bjarne Knudsen,et al.  RNA secondary structure prediction using stochastic context-free grammars and evolutionary history , 1999, Bioinform..

[46]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[47]  Byung-Jun Yoon,et al.  RNA secondary structure prediction using context-sensitive hidden Markov models , 2004 .

[48]  Sean R. Eddy,et al.  Rfam: an RNA family database , 2003, Nucleic Acids Res..

[49]  François Coste,et al.  A Similar Fragments Merging Approach to Learn Automata on Proteins , 2005, ECML.

[50]  KharHengChoo,et al.  Recent Applications of Hidden Markov Models in Computational Biology , 2004 .

[51]  Fernando Pereira,et al.  Definite clause grammars for language analysis , 1986 .

[52]  Elena Rivas,et al.  Noncoding RNA gene detection using comparative sequence analysis , 2001, BMC Bioinformatics.

[53]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..