Basic Gene Grammars and DNA-ChartParser for language processing of Escherichia coli promoter DNA sequences

MOTIVATION The field of 'DNA linguistics' has emerged from pioneering work in computational linguistics and molecular biology. Most formal grammars in this field are expressed using Definite Clause Grammars but these have computational limitations which must be overcome. The present study provides a new DNA parsing system, comprising a logic grammar formalism called Basic Gene Grammars and a bidirectional chart parser DNA-ChartParser. RESULTS The use of Basic Gene Grammars is demonstrated in representing many formulations of the knowledge of Escherichia coli promoters, including knowledge acquired from human experts, consensus sequences, statistics (weight matrices), symbolic learning, and neural network learning. The DNA-ChartParser provides bidirectional parsing facilities for BGGs in handling overlapping categories, gap categories, approximate pattern matching, and constraints. Basic Gene Grammars and the DNA-ChartParser allowed different sources of knowledge for recognizing E.coli promoters to be combined to achieve better accuracy as assessed by parsing these DNA sequences in real-world data sets.

[1]  R. Harr,et al.  Search algorithm for pattern match analysis of nucleic acid sequences. , 1983, Nucleic acids research.

[2]  David B. Searls,et al.  Linguistic approaches to biological sequences , 1997, Comput. Appl. Biosci..

[3]  David B. Searls Investigating the Linguistics of DNA with Definite Clause Grammars , 1989, NACLP.

[4]  K. Murakami,et al.  Gene recognition by combination of several gene-finding programs , 1998, Bioinform..

[5]  J. Collado-Vides,et al.  Grammatical model of the regulation of gene expression. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[6]  J. Collado-Vides,et al.  Integrative representations of the regulation of gene expression , 1996 .

[7]  H. Margalit,et al.  Compilation of E. coli mRNA promoter sequences. , 1993, Nucleic acids research.

[8]  G. Mehldau,et al.  A system for pattern matching applications on biosequences , 1993, Comput. Appl. Biosci..

[9]  S Ji,et al.  The Linguistics of DNA: Words, Sentences, Grammar, Phonetics, and Semantics , 1999, Annals of the New York Academy of Sciences.

[10]  S. Colowick,et al.  Methods in Enzymology , Vol , 1966 .

[11]  Jude Shavlik,et al.  Using neural networks to refine existing biological knowledge , 1992 .

[12]  M. Gribskov,et al.  Sequence Analysis Primer , 1991 .

[13]  Fernando Pereira,et al.  Definite clause grammars for language analysis , 1986 .

[14]  Chris Mellish,et al.  Natural Language Processing in PROLOG , 1989 .

[15]  David B. Searls,et al.  The Linguistics of DNA , 1992 .

[16]  LiMin Fu,et al.  Knowledge Discovery by Inductive Neural Networks , 1999, IEEE Trans. Knowl. Data Eng..

[17]  Jacob V. Maizel,et al.  Discriminant analysis of promoter regions in Escherichia coli sequences , 1988, Comput. Appl. Biosci..

[18]  David B. Searls,et al.  The computational linguistics of biological sequences , 1993, ISMB 1995.

[19]  William F. Clocksin,et al.  Programming in Prolog , 1987, Springer Berlin Heidelberg.

[20]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[21]  M.McGee Wood,et al.  Natural language processing in Prolog , 1990 .

[22]  W R Pearson,et al.  Dynamic programming algorithms for biological sequence comparison. , 1992, Methods in enzymology.

[23]  Xindong Wu Knowledge Acquisition from Data Bases , 1993 .

[24]  Chris Mellish,et al.  Some Chart-Based Techniques for Parsing Ill-Formed Input , 1989, ACL.

[25]  M Kanehisa,et al.  An assessment of neural network and statistical approaches for prediction of E. coli promoter sites. , 1992, Nucleic acids research.

[26]  Denis Thieffry,et al.  Syntactic recognition of regulatory regions in Escherichia coli , 1996, Comput. Appl. Biosci..

[27]  Michael Gribskov The Language Metaphor in Sequence Analysis , 1992, Comput. Chem..

[28]  D. B. Searls,et al.  Pattern-matching search of DNA sequences using logic grammars , 1991, [1991] Proceedings. The Seventh IEEE Conference on Artificial Intelligence Application.