Promoter prediction using DNA numerical representation and neural network: Case study with three organisms

Promoter recognition in various organisms is an area of interest in bioinformatics. In this paper, a feed-forward neural network classifier is presented to predict promoters in three organisms using a DNA numerical representation approach. The proposed system was found to be able to predict promoters with a sensitivity of 87%, 87%, 99% while reducing false prediction rate for non-promoter sequences with a specificity of 92%, 94%, 99% for the human, Drosophila melanogaster, and Arabidopsis thaliana sequences respectively. The results show that feed-forward neural networks can extract the statistical characteristics of promoters efficiently, and that the 2-bit binary coding for DNA data is suitable for the Berkeley Human and Drosophila datasets and the 4-bit binary is suitable for the TAIR Arabidopsis thaliana data sets. Another result demonstrated here is that the proposed prediction system is reconfigurable and versatile with a reduced architecture and computational complexity.

[1]  T. Hubbard,et al.  Computational detection and location of transcription start sites in mammalian genomic DNA. , 2002, Genome research.

[2]  T. Werner,et al.  Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. , 2000, Journal of molecular biology.

[3]  Seng Hong Seah,et al.  Dragon gene start finder: an advanced system for finding approximate locations of the start of gene transcriptional units. , 2003, Genome research.

[4]  Vladimir Brusic,et al.  Computer model for recognition of functional transcription start sites in RNA polymerase II promoters of vertebrates. , 2003, Journal of molecular graphics & modelling.

[5]  R. Damasevicius,et al.  Analysis of binary feature mapping rules for promoter recognition in imbalanced DNA sequence datasets using Support Vector Machine , 2008, 2008 4th International IEEE Conference Intelligent Systems.

[6]  Hon Keung Kwan,et al.  Graphical representation of DNA sequences , 2009, 2009 IEEE International Conference on Electro/Information Technology.

[7]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[8]  Yongchun Zuo,et al.  Predicting Plant Pol-II Promoter Based on Subsequence Increment of Overlap Content Diversity , 2009, 2009 2nd International Conference on Biomedical Engineering and Informatics.

[9]  Hong Yan,et al.  PCA-HPR: A principle component analysis model for human promoter recognition , 2008, Bioinformation.

[10]  Sin Lam Tan,et al.  Promoter prediction analysis on the whole human genome , 2004, Nature Biotechnology.

[11]  G. Rubin,et al.  Computational analysis of core promoters in the Drosophila genome , 2002, Genome Biology.

[12]  T. Werner Models for prediction and recognition of eukaryotic promoters , 1999, Mammalian Genome.

[13]  T. Sakurai,et al.  Identification of plant promoter constituents by analysis of local distribution of short sequences , 2007, BMC Genomics.

[14]  Vasile Palade,et al.  A neural network based multi-classifier system for gene identification in DNA sequences , 2004, Neural Computing & Applications.

[15]  Michael Q. Zhang,et al.  Computational identification of promoters and first exons in the human genome , 2001, Nature Genetics.

[16]  Hon Keung Kwan,et al.  Numerical representation of DNA sequences , 2009, 2009 IEEE International Conference on Electro/Information Technology.