Computational identification of eukaryotic promoters based on cascaded deep capsule neural networks

A promoter is a region in the DNA sequence that defines where the transcription of a gene by RNA polymerase initiates, which is typically located proximal to the transcription start site (TSS). How to correctly identify the gene TSS and the core promoter is essential for our understanding of the transcriptional regulation of genes. As a complement to conventional experimental methods, computational techniques with easy-to-use platforms as essential bioinformatics tools can be effectively applied to annotate the functions and physiological roles of promoters. In this work, we propose a deep learning-based method termed Depicter (Deep learning for predicting promoter), for identifying three specific types of promoters, i.e. promoter sequences with the TATA-box (TATA model), promoter sequences without the TATA-box (non-TATA model), and indistinguishable promoters (TATA and non-TATA model). Depicter is developed based on an up-to-date, species-specific dataset which includes Homo sapiens, Mus musculus, Drosophila melanogaster and Arabidopsis thaliana promoters. A convolutional neural network coupled with capsule layers is proposed to train and optimize the prediction model of Depicter. Extensive benchmarking and independent tests demonstrate that Depicter achieves an improved predictive performance compared with several state-of-the-art methods. The webserver of Depicter is implemented and freely accessible at https://depicter.erc.monash.edu/.

[1]  Ben Berkhout,et al.  RNA Polymerase II Activity of Type 3 Pol III Promoters , 2018, Molecular therapy. Nucleic acids.

[2]  Xiuping Jia,et al.  Deep Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Neural Networks , 2016, IEEE Transactions on Geoscience and Remote Sensing.

[3]  Cangzhi Jia,et al.  EnhancerPred2.0: predicting enhancers and their strength based on position-specific trinucleotide propensity and electron-ion interaction potential feature selection. , 2017, Molecular bioSystems.

[4]  Riqing Chen,et al.  A Similarity Searching System for Biological Phenotype Images Using Deep Convolutional Encoder-decoder Architecture , 2019, Current Bioinformatics.

[5]  Fang Wang,et al.  Dysfunctional Mechanism of Liver Cancer Mediated by Transcription Factor and Non-coding RNA , 2019, Current Bioinformatics.

[6]  Wensheng Deng,et al.  A core promoter element downstream of the TATA box that is recognized by TFIIB. , 2005, Genes & development.

[7]  Manju Bansal,et al.  Identification of putative promoters in 48 eukaryotic genomes on the basis of DNA free energy , 2018, Scientific Reports.

[8]  Xia Sun,et al.  Drug and Nondrug Classification Based on Deep Learning with Various Feature Selection Strategies , 2018 .

[9]  Geoffrey I. Webb,et al.  GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome , 2015, Bioinform..

[10]  Valery Shepelev,et al.  Advances in the Exon-Intron Database (EID) , 2006, Briefings Bioinform..

[11]  Ying Ju,et al.  Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy , 2016, BMC Systems Biology.

[12]  Gholamreza Haffari,et al.  Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods , 2018, Briefings Bioinform..

[13]  Leonidas Aristodemou,et al.  The state-of-the-art on Intellectual Property Analytics (IPA): A literature review on artificial intelligence, machine learning and deep learning methods for analysing intellectual property (IP) data , 2018, World Patent Information.

[14]  V. Solovyev,et al.  Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks , 2016, PloS one.

[15]  Guy Drouin,et al.  Structural differentiation of the three eukaryotic RNA polymerases. , 2009, Genomics.

[16]  Jiangning Song,et al.  PASSION: an ensemble neural network approach for identifying the binding sites of RBPs on circRNAs , 2020, Bioinform..

[17]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[18]  Mickaël Coustaty,et al.  Visualization of High-Dimensional Data by Pairwise Fusion Matrices Using t-SNE , 2019, Symmetry.

[19]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[20]  Giovanna Ambrosini,et al.  EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era , 2012, Nucleic Acids Res..

[21]  Geoffrey I. Webb,et al.  Positive-unlabelled learning of glycosylation sites in the human proteome , 2019, BMC Bioinformatics.

[22]  Xingpeng Jiang,et al.  Sequence clustering in bioinformatics: an empirical study. , 2018, Briefings in bioinformatics.

[23]  Fan Yang,et al.  iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC , 2018, Bioinform..

[24]  Yu Li,et al.  Promoter analysis and prediction in the human genome using sequence-based deep learning models , 2019, Bioinform..

[25]  Kil To Chong,et al.  DeePromoter: Robust Promoter Predictor Using Deep Learning , 2019, Front. Genet..

[26]  James T Kadonaga,et al.  Rational design of a super core promoter that enhances gene expression , 2006, Nature Methods.

[27]  Long Vo Ngoc,et al.  The punctilious RNA polymerase II core promoter , 2017, Genes & development.

[28]  Ernest Martinez,et al.  Core promoter-specific gene regulation: TATA box selectivity and Initiator-dependent bi-directionality of serum response factor-activated transcription. , 2016, Biochimica et biophysica acta.

[29]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[30]  Fengqi You,et al.  Optimization under Uncertainty in the Era of Big Data and Deep Learning: When Machine Learning Meets Mathematical Programming , 2019, Comput. Chem. Eng..

[31]  R. Mann,et al.  Deconvolving the Recognition of DNA Shape from Sequence , 2015, Cell.

[32]  Vladimir B. Bajic,et al.  High Sensitivity TSS Prediction: Estimates of Locations Where TSS Cannot Occur , 2010, PloS one.

[33]  Geoffrey I. Webb,et al.  iLearn : an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data , 2019, Briefings Bioinform..

[34]  Jiangning Song,et al.  Inspector: A lysine succinylation predictor based on edited nearest-neighbor undersampling and adaptive synthetic oversampling. , 2020, Analytical biochemistry.

[35]  HuangYing,et al.  CD-HIT Suite , 2010 .

[36]  Steen Knudsen,et al.  Promoter2.0: for the recognition of PolII promoter sequences , 1999, Bioinform..

[37]  P Cramer,et al.  Functional association between promoter structure and transcript alternative splicing. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Jiangning Song,et al.  Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences , 2019, Briefings Bioinform..

[39]  Ningning He,et al.  Identification of Protein Lysine Crotonylation Sites by a Deep Learning Framework With Convolutional Neural Networks , 2020, IEEE Access.

[40]  Jiangning Song,et al.  Quokka: a comprehensive tool for rapid and accurate prediction of kinase family‐specific phosphorylation sites in the human proteome , 2018, Bioinform..

[41]  Xing Gao,et al.  Integration of deep feature representations and handcrafted features to improve the prediction of N6-methyladenosine sites , 2019, Neurocomputing.

[42]  Hongjie Zhang,et al.  A novel quality evaluation method for resistance spot welding based on the electrode displacement signal and the Chernoff faces technique , 2015 .

[43]  Jiangning Song,et al.  MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters , 2019, Bioinform..

[44]  Meng Zhang,et al.  Formator: predicting lysine formylation sites based on the most distant undersampling and safe-level synthetic minority oversampling. , 2019, IEEE/ACM transactions on computational biology and bioinformatics.

[45]  Hui Yang,et al.  iDNA-MS: An Integrated Computational Tool for Detecting DNA Modification Sites in Multiple Genomes , 2020, iScience.

[46]  U. Ohler Identification of core promoter modules in Drosophila and their application in accurate transcription start site prediction , 2006, Nucleic acids research.

[47]  Giovanna Ambrosini,et al.  The Eukaryotic Promoter Database: expansion of EPDnew and new promoter analysis tools , 2014, Nucleic Acids Res..

[48]  Svetha Venkatesh,et al.  DeepTRIAGE: interpretable and individualised biomarker scores using attention mechanism for the classification of breast cancer sub-types , 2019, bioRxiv.

[49]  Yanchun Liang,et al.  Capsule network for protein post-translational modification site prediction , 2018, Bioinform..

[50]  Geoffrey I. Webb,et al.  DeepCleave: a deep learning predictor for caspase and matrix metalloprotease substrates and cleavage sites , 2019, Bioinform..

[51]  Manju Bansal,et al.  DNA structural features of eukaryotic TATA‐containing and TATA‐less promoters , 2017, FEBS open bio.

[52]  Reuven Agami,et al.  Transcription initiation determines its end. , 2015, Molecular cell.

[53]  Geoffrey I. Webb,et al.  Procleave: Predicting Protease-specific Substrate Cleavage Sites by Combining Sequence and Structural Information , 2020, Genom. Proteom. Bioinform..

[54]  Paraskevas Tsangaratos,et al.  Groundwater Spring Potential Mapping Using Artificial Intelligence Approach Based on Kernel Logistic Regression, Random Forest, and Alternating Decision Tree Models , 2020, Applied Sciences.

[55]  Philipp Petersen,et al.  Optimal approximation of piecewise smooth functions using deep ReLU neural networks , 2017, Neural Networks.

[56]  Manju Bansal,et al.  Characterization of structural and free energy properties of promoters associated with Primary and Operon TSS in Helicobacter pylori genome and their orthologs , 2012, Journal of Biosciences.

[57]  Swakkhar Shatabda,et al.  iPro70-FMWin: identifying Sigma70 promoters using multiple windowing and minimal features , 2018, Molecular Genetics and Genomics.

[58]  James T Kadonaga,et al.  The DPE, a core promoter element for transcription by RNA polymerase II , 2002, Experimental & Molecular Medicine.

[59]  Yvan Saeys,et al.  Generic eukaryotic core promoter prediction using structural features of DNA. , 2008, Genome research.

[60]  Cornelio Yáñez-Márquez,et al.  One-Hot Vector Hybrid Associative Classifier for Medical Data Classification , 2014, PloS one.

[61]  Yufei Huang,et al.  FunDMDeep-m6A: identification and prioritization of functional differential m6A methylation genes , 2019, Bioinform..

[62]  Wenhu Tang,et al.  Deep Learning for Daily Peak Load Forecasting–A Novel Gated Recurrent Neural Network Combining Dynamic Time Warping , 2019, IEEE Access.

[63]  Wei Chen,et al.  iProEP: A Computational Predictor for Predicting Promoter , 2019, Molecular therapy. Nucleic acids.