Improving promoter prediction for the NNPP 2 . 2 algorithm : a case study using Escherichia coli DNA sequences

MOTIVATION Although a great deal of research has been undertaken in the area of promoter prediction, prediction techniques are still not fully developed. Many algorithms tend to exhibit poor specificity, generating many false positives, or poor sensitivity. The neural network prediction program NNPP2.2 is one such example. RESULTS To improve the NNPP2.2 prediction technique, the distance between the transcription start site (TSS) associated with the promoter and the translation start site (TLS) of the subsequent gene coding region has been studied for Escherichia coli K12 bacteria. An empirical probability distribution that is consistent for all E.coli promoters has been established. This information is combined with the results from NNPP2.2 to create a new technique called TLS-NNPP, which improves the specificity of promoter prediction. The technique is shown to be effective using E.coli DNA sequences, however, it is applicable to any organism for which a set of promoters has been experimentally defined. AVAILABILITY The data used in this project and the prediction results for the tested sequences can be obtained from http://www.uow.edu.au/~yanxia/E_Coli_paper/SBurden_Results.xls CONTACT alh98@uow.edu.au.

[1]  Peter D. Karp,et al.  The EcoCyc Database , 2002, Nucleic Acids Res..

[2]  David Corne,et al.  Evolving core promoter signal motifs , 2001, Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546).

[3]  Vladimir B. Bajic,et al.  Dragon Gene Start Finder identifies approximate locations of the 5' ends of genes , 2003, Nucleic Acids Res..

[4]  S Brunak,et al.  A DNA structural atlas for Escherichia coli. , 2000, Journal of molecular biology.

[5]  Cathy H. Wu Artificial Neural Networks for Molecular Sequence Analysis , 1997, Comput. Chem..

[6]  Mikhail S. Gelfand,et al.  Genome-Wide Analysis of Bacterial Promoter Regions , 2002, Pacific Symposium on Biocomputing.

[7]  M. G. Reese,et al.  NOVEL NEURAL NETWORK PREDICTION SYSTEMS FOR HUMAN PROMOTERS AND SPLICE SITES , 1995 .

[8]  M. Riley,et al.  MultiFun, a multifunctional classification scheme for Escherichia coli K-12 gene products. , 2000, Microbial & comparative genomics.

[9]  Craig J. Benham,et al.  Computation of DNA structural variability - a new predictor of DNA regulatory regions , 1996, Comput. Appl. Biosci..

[10]  A A Deev,et al.  Non-canonical sequence elements in the promoter structure. Cluster analysis of promoters recognized by Escherichia coli RNA polymerase. , 1997, Nucleic acids research.

[11]  Jean-Michel Claverie,et al.  Detection of Eukaryotic Promoters Using Markov Transition Matrices , 1997, Comput. Chem..

[12]  T. Hubbard,et al.  Computational detection and location of transcription start sites in mammalian genomic DNA. , 2002, Genome research.

[13]  M. Q. Zhang,et al.  Periodical distribution of transcription factor sites in promoter regions and connection with chromatin structure. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  S Harbeck,et al.  Stochastic segment models of eukaryotic promoter regions. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[15]  Dennis Shasha,et al.  DNA sequence classification via an expectation maximization algorithm and neural networks: a case study , 2001, IEEE Trans. Syst. Man Cybern. Part C.

[16]  Gary D. Stormo,et al.  PromFD 1.0: a computer program that predicts eukaryotic pol II promoters using strings and IMD matrices , 1997, Comput. Appl. Biosci..

[17]  Elmar Nöth,et al.  Interpolated markov chains for eukaryotic promoter recognition , 1999, Bioinform..

[18]  Rongxiang Liu,et al.  Consensus promoter identification in the human genome utilizing expressed gene markers and gene modeling. , 2002, Genome research.

[19]  Jihoon Yang,et al.  Data-driven theory refinement algorithms for bioinformatics , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[20]  Jason T. L. Wang,et al.  Recognizing Promoters in DNA Using Bayesian Neural Networks , 1999 .

[21]  T. Werner,et al.  Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. , 2000, Journal of molecular biology.

[22]  A. A. Deev,et al.  Distribution and Functional Significance of A/T Tracts in Promoter Sequences of Escherichia coli , 2002, Molecular Biology.

[23]  P. Bucher Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. , 1990, Journal of molecular biology.

[24]  Victor V. Solovyev,et al.  PromH: promoters identification using orthologous genomic sequences , 2003, Nucleic Acids Res..

[25]  G. Zhou,et al.  Neural network optimization for E. coli promoter prediction. , 1991, Nucleic acids research.

[26]  Thomas Werner,et al.  GenomeInspector: a new approach to detect correlation patterns of elements on genomic sequences , 1996, Comput. Appl. Biosci..

[27]  Steen Knudsen,et al.  Promoter2.0: for the recognition of PolII promoter sequences , 1999, Bioinform..

[28]  Sanjeev S. Tambe,et al.  Artificial neural networks for prediction of mycobacterial promoter sequences , 2003, Comput. Biol. Chem..