Learning to Recognize Promoter Sequences in E. coli by Modeling Uncertainty in the Training Data

Automatic recognition of promoter sequences is an important open problem in molecular biology. Unfortunately, the usual machine learning version of this problem is critically flawed. In particular, the dataset available from the Irvine repository was drawn from a compilation of promoter sequences that were pre-processed to conform to the biologists' related notion of the consensus sequence, a first-order approximation with a number of shortcomings that are well-known in molecular biology. Although concept descriptions learned from the Irvine data may represent the consensus sequence, they do not represent promoters. More generally, imperfections in preprocessed data and statistical variations in the locations of biologically meaningful features within the raw data invalidate standard attribute-based approaches. I suggest a dataset, a concept-description language, and a model of uncertainty in the promoter data that are all biologically justified, then address the learning problem with incremental probabilistic evidence combination. This knowledge-based approach yields a more accurate and more credible solution than other more conventional machine learning systems.

[1]  H. Margalit,et al.  Compilation of E. coli mRNA promoter sequences. , 1993, Nucleic acids research.

[2]  M. O'Neill Escherichia coli promoters. I. Consensus as it relates to spacing class, specificity, repeat substructure, and three-dimensional organization. , 1989, The Journal of biological chemistry.

[3]  R. Dickerson,et al.  Base sequence and helix structure variation in B and A DNA. , 1983, Journal of molecular biology.

[4]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[5]  Haym Hirsh,et al.  Classifier Learning from Noisy Data as Probabilistic Evidence Combination , 1992, AAAI.

[6]  C. Harley,et al.  Analysis of E. coli promoter sequences. , 1987, Nucleic acids research.

[7]  P. V. von Hippel,et al.  Protein-nucleic acid interactions in transcription: a molecular analysis. , 1984, Annual review of biochemistry.

[8]  W. McClure,et al.  Mechanism and control of transcription initiation in prokaryotes. , 1985, Annual review of biochemistry.

[9]  Howard B. Gamper,et al.  A topological model for transcription based on unwinding angle analysis of E. coli RNA polymerase binary, initiation and ternary complexes , 1982, Cell.

[10]  Haym Hirsh,et al.  Using background knowledge to improve inductive learning of DNA sequences , 1994, Proceedings of the Tenth Conference on Artificial Intelligence for Applications.

[11]  Jeffrey W. Roberts,et al.  遺伝子の分子生物学 = Molecular biology of the gene , 1970 .

[12]  Jude Shavlik,et al.  Refinement ofApproximate Domain Theories by Knowledge-Based Neural Networks , 1990, AAAI.

[13]  R. Nussinov,et al.  Periodic structurally similar oligomers are found on one side of the axes of symmetry in the lac, trp, and gal operators. , 1984, Journal of biomolecular structure & dynamics.

[14]  Haym Hirsh,et al.  Learning DNF Via Probabilistic Evidence Combination , 1993, ICML.

[15]  J. Gralla,et al.  All three elements of the lac ps promoter mediate its transcriptional response to DNA supercoiling. , 1987, Journal of molecular biology.

[16]  M. Waterman Mathematical Methods for DNA Sequences , 1989 .

[17]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[18]  Jude W. Shavlik,et al.  Using Symbolic Learning to Improve Knowledge-Based Neural Networks , 1992, AAAI.

[19]  D. K. Hawley,et al.  Compilation and analysis of Escherichia coli promoter DNA sequences. , 1983, Nucleic acids research.

[20]  A. Cornish-Bowden Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. , 1985, Nucleic acids research.

[21]  Walter Gilbert,et al.  E. coli RNA polymerase interacts homologously with two different promoters , 1980, Cell.

[22]  D. Pribnow Nucleotide sequence of an RNA polymerase binding site at an early T7 promoter. , 1975, Proceedings of the National Academy of Sciences of the United States of America.

[23]  P. Youderian,et al.  Sequence determinants of promoter activity , 1982, Cell.