Predicting E. Coli promoters using formal grammars

Ever since the structure of the DNA was discovered, linguistics has been part of molecular biology [13]. Grammatical linguistics is a powerful method to express information and describe its structure. It can be used to express transcribed in DNAs. Most formal grammar applications on DNAs are based on Searls DNA parsing approach using Prolog-based Definite Clause Grammars (DFG) [11]. Extensions of this approach include String Variable Grammar [6] and Basic Gene Grammars [5]. This paper presents a novel approach by parsing Escherichia Coli (E. Coli) promoter sequences using a Context-Free Grammar (CFG). The approach takes advantage of an error correcting parsing algorithm introduced by Rajasekaran and Nicolae [1]. The idea is to derive a grammar for known promoter regions and then modify this grammar to tolerate errors. The resulting cover grammar can then be employed to recognize promoter regions. Introducing probabilities in the production rules can further extend the cover grammar. Please note that in this paper we introduce this novel paradigm. In our future work we will implement this approach and test it on various datasets.

[1]  Sanguthevar Rajasekaran,et al.  An Error Correcting Parser for Context Free Grammars that Takes Less Than Cubic Time , 2014, LATA.

[2]  Sanguthevar Rajasekaran,et al.  Framework for Data Mining of Big Data Using Probabilistic Grammars , 2015, 2015 Fifth International Conference on e-Learning (econf).

[3]  Subhasis Mukhopadhyay,et al.  A Composite Method Based on Formal Grammar and DNA Structural Features in Detecting Human Polymerase II Promoter Region , 2013, PloS one.

[4]  Wolfgang Raible The grammar of genes: How the genetic code resembles the linguistic code (review) , 2008 .

[5]  Angel Lopez-Garcia,et al.  The Grammar of Genes: How the Genetic Code Resembles the Linguistic Code , 2005 .

[6]  S. Busby,et al.  Identification and analysis of 'extended -10' promoters in Escherichia coli. , 2003, Nucleic acids research.

[7]  Chris Mellish,et al.  Basic Gene Grammars and DNA-ChartParser for language processing of Escherichia coli promoter DNA sequences , 2001, Bioinform..

[8]  David B. Searls,et al.  String Variable Grammar: A Logic Grammar Formalism for the Biological Language of DNA , 1995, J. Log. Program..

[9]  H. Margalit,et al.  Compilation of E. coli mRNA promoter sequences. , 1993, Nucleic acids research.

[10]  David B. Searls,et al.  The computational linguistics of biological sequences , 1993, ISMB 1995.

[11]  David B. Searls,et al.  The Linguistics of DNA , 1992 .

[12]  David B. Searls Representing Genetic Information with Formal Grammars , 1988, AAAI.

[13]  C. Harley,et al.  Analysis of E. coli promoter sequences. , 1987, Nucleic acids research.

[14]  Arto Salomaa,et al.  Formal languages , 1973, Computer science classics.

[15]  Peter C. Chapin Formal languages I , 1973, CSC '73.