Analysis of E.coli promoter recognition problem in dinucleotide feature space

MOTIVATION Patterns in the promoter sequences within a species are known to be conserved but there exist many exceptions to this rule which makes the promoter recognition a complex problem. Although many complex feature extraction schemes coupled with several classifiers have been proposed for promoter recognition in the current literature, the problem is still open. RESULTS A dinucleotide global feature extraction method is proposed for the recognition of sigma-70 promoters in Escherichia coli in this article. The positive data set consists of sigma-70 promoters with known transcription starting points which are part of regulonDB and promec databases. Four different kinds of negative data sets are considered, two of them biological sets (Gordon et al., 2003) and the other two synthetic data sets. Our results reveal that a single-layer perceptron using dinucleotide features is able to achieve an accuracy of 80% against a background of biological non-promoters and 96% for random data sets. A scheme for locating the promoter regions in a given genome sequence is proposed. A deeper analysis of the data set shows that there is a bifurcation of the data set into two distinct classes, a majority class and a minority class. Our results point out that majority class constituting the majority promoter and the majority non-promoter signal is linearly separable. Also the minority class is linearly separable. We further show that the feature extraction and classification methods proposed in the paper are generic enough to be applied to the more complex problem of eucaryotic promoter recognition. We present Drosophila promoter recognition as a case study. AVAILABILITY http://202.41.85.117/htmfiles/faculty/tsr/tsr.html.

[1]  C. Harley,et al.  Analysis of E. coli promoter sequences. , 1987, Nucleic acids research.

[2]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[3]  I Mahadevan,et al.  Analysis of E.coli promoter structures using neural networks. , 1994, Nucleic acids research.

[4]  S. Aiyar,et al.  Escherichia coli Promoters with UP Elements of Different Strengths: Modular Structure of Bacterial Promoters , 1998, Journal of bacteriology.

[5]  T. Werner Models for prediction and recognition of eukaryotic promoters , 1999, Mammalian Genome.

[6]  Hanah Margalit,et al.  PromEC: An updated database of Escherichia coli mRNA promoters with experimentally identified transcriptional start sites , 2001, Nucleic Acids Res..

[7]  Dennis Shasha,et al.  DNA sequence classification via an expectation maximization algorithm and neural networks: a case study , 2001, IEEE Trans. Syst. Man Cybern. Part C.

[8]  G. Rubin,et al.  Computational analysis of core promoters in the Drosophila genome , 2002, Genome Biology.

[9]  Julio Collado-Vides,et al.  Sigma70 promoters in Escherichia coli: specific transcription in dense regions of overlapping promoter-like signals. , 2003, Journal of molecular biology.

[10]  S. Durga Bhavani,et al.  Identification of Promoter Region in a DNA Sequence Using EM Algorithm and Neural Networks , 2003, IICAI.

[11]  Alexander Gammerman,et al.  Sequence alignment kernel for recognition of promoter regions , 2003, Bioinform..

[12]  G. Stormo,et al.  Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. , 1992, Journal of molecular biology.

[13]  Vasile Palade,et al.  A neural network based multi-classifier system for gene identification in DNA sequences , 2004, Neural Computing & Applications.

[14]  Cheng-Jian Lin,et al.  Prediction of RNA Polymerase Binding Sites Using Purine-Pyrimidine Encoding and Hybrid Learning Methods , 2004 .

[15]  Kiyoshi Asai,et al.  Extracting relations between promoter sequences and their strengths from microarray data , 2005, Bioinform..

[16]  Armin Shmilovici,et al.  Identification of transcription factor binding sites with variable-order Bayesian networks , 2005, Bioinform..