Eukaryotic and prokaryotic promoter prediction using hybrid approach

Promoters are modular DNA structures containing complex regulatory elements required for gene transcription initiation. Hence, the identification of promoters using machine learning approach is very important for improving genome annotation and understanding transcriptional regulation. In recent years, many methods have been proposed for the prediction of eukaryotic and prokaryotic promoters. However, the performances of these methods are still far from being satisfactory. In this article, we develop a hybrid approach (called IPMD) that combines position correlation score function and increment of diversity with modified Mahalanobis Discriminant to predict eukaryotic and prokaryotic promoters. By applying the proposed method to Drosophila melanogaster, Homo sapiens, Caenorhabditis elegans, Escherichia coli, and Bacillus subtilis promoter sequences, we achieve the sensitivities and specificities of 90.6 and 97.4% for D. melanogaster, 88.1 and 94.1% for H. sapiens, 83.3 and 95.2% for C. elegans, 84.9 and 91.4% for E. coli, as well as 80.4 and 91.3% for B. subtilis. The high accuracies indicate that the IPMD is an efficient method for the identification of eukaryotic and prokaryotic promoters. This approach can also be extended to predict other species promoters.

[1]  R. Laxton The measure of diversity. , 1978, Journal of theoretical biology.

[2]  G. Rubin,et al.  Computational analysis of core promoters in the Drosophila genome , 2002, Genome Biology.

[3]  R. Zhang,et al.  Improving Promoter Prediction for the NNPP 2 . 2 Algorithm : A Case Study Using EColi DNA Sequences , 2004 .

[4]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[5]  Manju Bansal,et al.  Identification and annotation of promoter regions in microbial genome sequences on the basis of DNA stability , 2007, Journal of Biosciences.

[6]  Hao Lin,et al.  The recognition and prediction of σ70 promoters in Escherichia coli K-12 , 2006 .

[7]  James M. Hogan,et al.  Improved prediction of bacterial transcription start sites , 2006 .

[8]  Gunnar Rätsch,et al.  ARTS: accurate recognition of transcription starts in human , 2006, ISMB.

[9]  Manju Bansal,et al.  Relative stability of DNA as a generic criterion for promoter prediction: whole genome annotation of microbial genomes with varying nucleotide base composition. , 2009, Molecular bioSystems.

[10]  Martin G. Reese,et al.  Application of a Time-delay Neural Network to Promoter Annotation in the Drosophila Melanogaster Genome , 2001, Comput. Chem..

[11]  A. Kassim,et al.  Digital signal processing for potential promoter prediction , 2004, IEEE International Workshop on Biomedical Circuits and Systems, 2004..

[12]  Heinrich Niemann,et al.  Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition , 2001, ISMB.

[13]  U. Ohler Identification of core promoter modules in Drosophila and their application in accurate transcription start site prediction , 2006, Nucleic acids research.

[14]  K. Chou,et al.  Prediction and classification of domain structural classes , 1998, Proteins.

[15]  S. Durga Bhavani,et al.  Analysis of E.coli promoter recognition problem in dinucleotide feature space , 2007, Bioinform..

[16]  M Kanehisa,et al.  An assessment of neural network and statistical approaches for prediction of E. coli promoter sites. , 1992, Nucleic acids research.

[17]  Liaofu Luo,et al.  Splice site prediction with quadratic discriminant analysis using diversity measure. , 2003, Nucleic acids research.

[18]  Michael Q. Zhang,et al.  Using CorePromoter to Find Human Core Promoters , 2005, Current protocols in bioinformatics.

[19]  Julio Collado-Vides,et al.  RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12 , 2004, Nucleic Acids Res..

[20]  Ray Walshe,et al.  Pol II promoter prediction using characteristic 4-mer motifs: a machine learning approach , 2008, BMC Bioinformatics.

[21]  Vladimir B. Bajic,et al.  Content analysis of the core promoter region of human genes , 2003, Silico Biol..

[22]  Michael Q. Zhang,et al.  Computational identification of promoters and first exons in the human genome , 2001, Nature Genetics.

[23]  D. K. Hawley,et al.  Compilation and analysis of Escherichia coli promoter DNA sequences. , 1983, Nucleic acids research.

[24]  Stefan Maetschke,et al.  Genome-wide analysis of chlamydiae for promoters that phylogenetically footprint. , 2007, Research in microbiology.

[25]  Yu Zhou,et al.  Human Pol II promoter recognition based on primary sequences and free energy of dinucleotides , 2007, BMC Bioinformatics.

[26]  Yves Moreau,et al.  Comprehensive analysis of the base composition around the transcription start site in Metazoa , 2004, BMC Genomics.

[27]  R. Zhang,et al.  Improving promoter prediction for the NNPP 2 . 2 algorithm : a case study using Escherichia coli DNA sequences , 2004 .

[28]  A. J. Gammerman,et al.  Plant promoter prediction with confidence estimation , 2005, Nucleic acids research.

[29]  Vladimir B. Bajic,et al.  Dragon Promoter Finder: recognition of vertebrate RNA polymerase II promoters , 2002, Bioinform..

[30]  Yvan Saeys,et al.  ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles , 2008, ISMB.

[31]  Modesto Orozco,et al.  Determining promoter location based on DNA structure first-principles calculations , 2007, Genome Biology.

[32]  Pierre Baldi,et al.  The Biology of Eukaryotic Promoter Prediction - A Review , 1999, Comput. Chem..

[33]  Victor G. Levitsky,et al.  Recognition of eukaryotic promoters using a genetic algorithm based on iterative discriminant analysis , 2003, Silico Biol..

[34]  Dennis F. Kibler,et al.  Using hexamers to predict cis-regulatory motifs in Drosophila , 2005, BMC Bioinformatics.

[35]  Valery Shepelev,et al.  Advances in the Exon-Intron Database (EID) , 2006, Briefings Bioinform..

[36]  Lior Pachter,et al.  Combining statistical alignment and phylogenetic footprinting to detect regulatory elements , 2008, Bioinform..

[37]  Philipp Bucher,et al.  EPD in its twentieth year: towards complete promoter coverage of selected model organisms , 2005, Nucleic Acids Res..

[38]  K. Chou A novel approach to predicting protein structural classes in a (20–1)‐D amino acid composition space , 1995, Proteins.

[39]  Victor V. Solovyev,et al.  PromH: promoters identification using orthologous genomic sequences , 2003, Nucleic Acids Res..

[40]  Szymon M. Kielbasa,et al.  Measuring similarities between transcription factor binding sites , 2005, BMC Bioinformatics.

[41]  T. Hubbard,et al.  Computational detection and location of transcription start sites in mammalian genomic DNA. , 2002, Genome research.

[42]  Peter Timms,et al.  Phylogenetic comparison of the known Chlamydia trachomatis sigma(66) promoters across to Chlamydia pneumoniae and Chlamydia caviae identifies seven poorly conserved promoters. , 2008, Research in microbiology.

[43]  Alexander Gammerman,et al.  Sequence alignment kernel for recognition of promoter regions , 2003, Bioinform..

[44]  D. S. Prestridge Predicting Pol II promoter sequences using transcription factor binding sites. , 1995, Journal of molecular biology.

[45]  Yvan Saeys,et al.  Generic eukaryotic core promoter prediction using structural features of DNA. , 2008, Genome research.

[46]  Panos Deloukas,et al.  DNA sequence and structural properties as predictors of human and mouse promoters , 2008, Gene.

[47]  Anders Gorm Pedersen,et al.  Investigations of Escherichia coli Promoter Sequences with Artificial Neural Networks: New Signals Discovered Upstream of the Transcriptional Startpoint , 1995, ISMB.

[48]  Kenta Nakai,et al.  BTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics , 2004, Nucleic Acids Res..

[49]  Huiquan Wang,et al.  Promoter prediction and annotation of microbial genomes based on DNA sequence and structural responses to superhelical stress , 2006, BMC Bioinformatics.

[50]  Elmar Nöth,et al.  Interpolated markov chains for eukaryotic promoter recognition , 1999, Bioinform..

[51]  G. B. Hutchinson,et al.  The prediction of vertebrate promoter regions using differential hexamer frequency analysis , 1996, Comput. Appl. Biosci..

[52]  Eric C. Rouchka,et al.  RBF-TSS: Identification of Transcription Start Site in Human Using Radial Basis Functions Network and Oligonucleotide Positional Frequencies , 2009, PloS one.

[53]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[54]  Julio Collado-Vides,et al.  Sigma70 promoters in Escherichia coli: specific transcription in dense regions of overlapping promoter-like signals. , 2003, Journal of molecular biology.

[55]  R. Gangal,et al.  Human pol II promoter prediction: time series descriptors and machine learning , 2005, Nucleic acids research.

[56]  Dominique Mouchiroud,et al.  CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences , 2002, Bioinform..

[57]  Jacques van Helden,et al.  Evaluation of phylogenetic footprint discovery for predicting bacterial cis-regulatory elements and revealing their evolution , 2008, BMC Bioinformatics.

[58]  Liaofu Luo,et al.  Use of  tetrapeptide signals for protein secondary-structure prediction , 2008, Amino Acids.

[59]  Steen Knudsen,et al.  Promoter2.0: for the recognition of PolII promoter sequences , 1999, Bioinform..

[60]  Pierre Baldi,et al.  Characterization of Prokaryotic and Eukaryotic Promoters Using Hidden Markov Models , 1996, ISMB.