DeePromoter: Robust Promoter Predictor Using Deep Learning

The promoter region is located near the transcription start sites and regulates transcription initiation of the gene by controlling the binding of RNA polymerase. Thus, promoter region recognition is an important area of interest in the field of bioinformatics. Numerous tools for promoter prediction were proposed. However, the reliability of these tools still needs to be improved. In this work, we propose a robust deep learning model, called DeePromoter, to analyze the characteristics of the short eukaryotic promoter sequences, and accurately recognize the human and mouse promoter sequences. DeePromoter combines a convolutional neural network (CNN) and a long short-term memory (LSTM). Additionally, instead of using non-promoter regions of the genome as a negative set, we derive a more challenging negative set from the promoter sequences. The proposed negative set reconstruction method improves the discrimination ability and significantly reduces the number of false positive predictions. Consequently, DeePromoter outperforms the previously proposed promoter prediction tools. In addition, a web-server for promoter prediction is developed based on the proposed methods and made available at https://home.jbnu.ac.kr/NSCL/deepromoter.htm.

[1]  Manju Bansal,et al.  A novel method for prokaryotic promoter prediction based on DNA stability , 2005, BMC Bioinformatics.

[2]  O. Stegle,et al.  DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning , 2016, Genome Biology.

[3]  Y Mizuno,et al.  A microdeletion of D6S305 in a family of autosomal recessive juvenile parkinsonism (PARK2). , 1998, Genomics.

[4]  松峯 宏人,et al.  A microdeletion of D6S305 in a family of autosomal recessive juvenile parkinsonism (PARK2) , 1999 .

[5]  T. Hubbard,et al.  Computational detection and location of transcription start sites in mammalian genomic DNA. , 2002, Genome research.

[6]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Bruce J. Aronow,et al.  Chromatin Immunoprecipitation Assays Footprints in Glycolytic Genes by Evaluation of Myc E-box Phylogenetic Supplemental Material , 2004 .

[8]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[9]  François Chollet,et al.  Keras: The Python Deep Learning library , 2018 .

[10]  V. Solovyev,et al.  Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks , 2016, PloS one.

[11]  R. Chiodini,et al.  The impact of next-generation sequencing on genomics. , 2011, Journal of genetics and genomics = Yi chuan xue bao.

[12]  J. T. Kadonaga,et al.  The RNA polymerase II core promoter - the gateway to transcription. , 2008, Current opinion in cell biology.

[13]  Hao Lin,et al.  Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  Elmar Nöth,et al.  Interpolated markov chains for eukaryotic promoter recognition , 1999, Bioinform..

[15]  Kil To Chong,et al.  iRNA-PseKNC(2methyl): Identify RNA 2'-O-methylation sites by convolution neural network and Chou's pseudo components. , 2019, Journal of theoretical biology.

[16]  Philippe Collas,et al.  A rapid micro chromatin immunoprecipitation assay (ChIP) , 2008, Nature Protocols.

[17]  Jijun Tang,et al.  Prediction of human protein subcellular localization using deep learning , 2017, J. Parallel Distributed Comput..

[18]  Steen Knudsen,et al.  Promoter2.0: for the recognition of PolII promoter sequences , 1999, Bioinform..

[19]  T. Werner,et al.  Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. , 2000, Journal of molecular biology.

[20]  Ruochi Zhang,et al.  Exploiting sequence-based features for predicting enhancer–promoter interactions , 2017, Bioinform..

[21]  Eugene Bolotin,et al.  Prevalence of the initiator over the TATA box in human and yeast genes and identification of DNA motifs enriched in human TATA-less core promoters. , 2007, Gene.

[22]  Kil To Chong,et al.  Branch Point Selection in RNA Splicing Using Deep Learning , 2019, IEEE Access.

[23]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[24]  R. Ji,et al.  Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  Dominique Mouchiroud,et al.  CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences , 2002, Bioinform..

[26]  Ramit Bharanikumar,et al.  PromoterPredict: sequence-based modelling of Escherichia coli σ70 promoter strength yields logarithmic dependence between promoter strength and sequence , 2018, bioRxiv.

[27]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[28]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[29]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[30]  Hilal Tayara,et al.  Deep Learning Models Based on Distributed Feature Representations for Alternative Splicing Prediction , 2018, IEEE Access.

[31]  K. Chou,et al.  iPSW(2L)-PseKNC: A two-layer predictor for identifying promoters and their strength by hybrid features via pseudo K-tuple nucleotide composition. , 2019, Genomics.

[32]  Giovanna Ambrosini,et al.  EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era , 2012, Nucleic Acids Res..

[33]  Philipp Bucher,et al.  The Eukaryotic Promoter Database (EPD) , 2000, Nucleic Acids Res..

[34]  Ernest Martinez,et al.  Core promoter-specific gene regulation: TATA box selectivity and Initiator-dependent bi-directionality of serum response factor-activated transcription. , 2016, Biochimica et biophysica acta.

[35]  Wanlei Zhou,et al.  Frequency Distribution of TATA Box and Extension Sequences on Human Promoters , 2006, IMSCCS.

[36]  G. B. Hutchinson,et al.  The prediction of vertebrate promoter regions using differential hexamer frequency analysis , 1996, Comput. Appl. Biosci..

[37]  J. T. Kadonaga,et al.  The RNA polymerase II core promoter. , 2003, Annual review of biochemistry.

[38]  Steven R. Head,et al.  Next-generation sequencing , 2010, Nature Reviews Drug Discovery.

[39]  Yu Li,et al.  Promoter analysis and prediction in the human genome using sequence-based deep learning models , 2019, Bioinform..

[40]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[41]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42]  Yu Zhang,et al.  An Improved Promoter Recognition Model Using Convolutional Neural Network , 2018, 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC).

[43]  Michael Ruogu Zhang,et al.  Computational identification of promoters and first exons in the human genome , 2002, Nature Genetics.

[44]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[45]  Michael Q. Zhang,et al.  Large-scale human promoter mapping using CpG islands , 2000, Nature Genetics.

[46]  Martin G. Reese,et al.  Application of a Time-delay Neural Network to Promoter Annotation in the Drosophila Melanogaster Genome , 2001, Comput. Chem..

[47]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[48]  D. S. Prestridge Predicting Pol II promoter sequences using transcription factor binding sites. , 1995, Journal of molecular biology.