Recurrent Neural Network for Predicting Transcription Factor Binding Sites

It is well known that DNA sequence contains a certain amount of transcription factors (TF) binding sites, and only part of them are identified through biological experiments. However, these experiments are expensive and time-consuming. To overcome these problems, some computational methods, based on k-mer features or convolutional neural networks, have been proposed to identify TF binding sites from DNA sequences. Although these methods have good performance, the context information that relates to TF binding sites is still lacking. Research indicates that standard recurrent neural networks (RNN) and its variants have better performance in time-series data compared with other models. In this study, we propose a model, named KEGRU, to identify TF binding sites by combining Bidirectional Gated Recurrent Unit (GRU) network with k-mer embedding. Firstly, DNA sequences are divided into k-mer sequences with a specified length and stride window. And then, we treat each k-mer as a word and pre-trained word representation model though word2vec algorithm. Thirdly, we construct a deep bidirectional GRU model for feature learning and classification. Experimental results have shown that our method has better performance compared with some state-of-the-art methods. Additional experiments about embedding strategy show that k-mer embedding will be helpful to enhance model performance. The robustness of KEGRU is proved by experiments with different k-mer length, stride window and embedding vector dimension.

[1]  De-Shuang Huang,et al.  iEnhancer‐EL: identifying enhancers and their strength with ensemble learning approach , 2018, Bioinform..

[2]  Edward G. Keating A Model and Application , 1998 .

[3]  De-Shuang Huang,et al.  Pupylation sites prediction with ensemble classification model , 2017, Int. J. Data Min. Bioinform..

[4]  Ning Chen,et al.  Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding , 2017, Bioinform..

[5]  De-Shuang Huang,et al.  Mining the bladder cancer-associated genes by an integrated strategy for the construction and analysis of differential co-expression networks , 2015, BMC Genomics.

[6]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[7]  W. Wasserman,et al.  Identification of altered cis-regulatory elements in human disease. , 2015, Trends in genetics : TIG.

[8]  Tatsunori B. Hashimoto,et al.  Discovery of non-directional and directional pioneer transcription factors by modeling DNase profile magnitude and shape , 2014, Nature Biotechnology.

[9]  Wenzheng Bao,et al.  Classification of Protein Structure Classes on Flexible Neutral Tree. , 2016, IEEE/ACM transactions on computational biology and bioinformatics.

[10]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[11]  Ritwick Sawarkar,et al.  Cis-regulatory variation: significance in biomedicine and evolution , 2014, Cell and Tissue Research.

[12]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  C. Sander,et al.  Genome-wide analysis of non-coding regulatory mutations in cancer , 2014, Nature Genetics.

[15]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[16]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[17]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[18]  Nan Deng,et al.  A Generalized dSpliceType Framework to Detect Differential Splicing and Differential Expression Events Using RNA-Seq , 2015, IEEE Transactions on NanoBioscience.

[19]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[20]  Michael A. Beer,et al.  Discriminative prediction of mammalian enhancers from DNA sequence. , 2011, Genome research.

[21]  Gal Chechik,et al.  Euclidean Embedding of Co-occurrence Data , 2004, J. Mach. Learn. Res..

[22]  Klaus Dietmayer,et al.  Dynamic Occupancy Grid Prediction for Urban Autonomous Driving: A Deep Learning Approach with Fully Automatic Labeling , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[23]  Yuehui Chen,et al.  Classification of Protein Structure Classes on Flexible Neutral Tree , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  Jianxing Feng,et al.  Imputation for transcription factor binding predictions based on deep learning , 2017, PLoS Comput. Biol..

[25]  E. Gusmão,et al.  Analysis of computational footprinting methods for DNase sequencing experiments , 2016, Nature Methods.

[26]  Liquan Xiao,et al.  On the Shoulders of Giants: Incremental Influence Maximization in Evolving Social Networks , 2015, Complex..

[27]  De-Shuang Huang,et al.  Novel human microbe-disease association prediction using network consistency projection , 2017, BMC Bioinformatics.

[28]  Lei Zhang,et al.  Prediction of protein-protein interactions based on protein-protein correlation using least squares regression. , 2014, Current protein & peptide science.

[29]  De-Shuang Huang,et al.  A Two-Stage Geometric Method for Pruning Unreliable Links in Protein-Protein Networks , 2015, IEEE Transactions on NanoBioscience.

[30]  Dan Xie,et al.  Dynamic trans-Acting Factor Colocalization in Human Cells , 2013, Cell.

[31]  Irina M. Conboy Faculty Opinions recommendation of DeepCRISPR: optimized CRISPR guide RNA design by deep learning. , 2018 .

[32]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[33]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[34]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[35]  Daniela Fischer,et al.  Digital Design And Computer Architecture , 2016 .

[36]  Lei Zhang,et al.  Tumor Clustering Using Nonnegative Matrix Factorization With Gene Selection , 2009, IEEE Transactions on Information Technology in Biomedicine.

[37]  Kyungsook Han,et al.  miRNA-Disease Association Prediction with Collaborative Matrix Factorization , 2017, Complex..

[38]  Jie Wang,et al.  Factorbook.org: a Wiki-based database for transcription factor-binding data generated by the ENCODE consortium , 2012, Nucleic Acids Res..

[39]  D. Latchman Transcription factors: an overview. , 1997, The international journal of biochemistry & cell biology.

[40]  De-Shuang Huang,et al.  Predicting Hub Genes Associated with Cervical Cancer through Gene Co-Expression Networks , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[41]  Ren Long,et al.  iRSpot-EL: identify recombination spots with an ensemble learning approach , 2017, Bioinform..

[42]  De-Shuang Huang,et al.  Direct AUC optimization of regulatory motifs , 2017, Bioinform..

[43]  D.-S. Huang,et al.  Radial Basis Probabilistic Neural Networks: Model and Application , 1999, Int. J. Pattern Recognit. Artif. Intell..

[44]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[45]  P. V. von Hippel,et al.  Increased subtlety of transcription factor binding increases complexity of genome regulation , 2014, Proceedings of the National Academy of Sciences.

[46]  R. Gordân,et al.  Protein–DNA binding: complexities and multi-protein codes , 2013, Nucleic acids research.

[47]  R. Nussinov,et al.  Mechanisms of transcription factor selectivity. , 2010, Trends in genetics : TIG.

[48]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[49]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[50]  Christopher L. Warren,et al.  A library of yeast transcription factor motifs reveals a widespread function for Rsc3 in targeting nucleosome exclusion at promoters. , 2008, Molecular cell.

[51]  Tara N. Sainath,et al.  The shared views of four research groups ) , 2012 .

[52]  Chao Wang,et al.  Improving protein fold recognition by extracting fold-specific features from predicted residue–residue contacts , 2017, Bioinform..

[53]  U. Kück,et al.  Use of bimolecular fluorescence complementation to demonstrate transcription factor interaction in nuclei of living cells from the filamentous fungus Acremonium chrysogenum , 2005, Current Genetics.

[54]  Atina G. Coté,et al.  Evaluation of methods for modeling transcription factor sequence specificity , 2013, Nature Biotechnology.

[55]  Daniel Quang,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015 .

[56]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[57]  De-Shuang Huang,et al.  A General CPL-AdS Methodology for Fixing Dynamic Parameters in Dual Environments , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[58]  N. Lennon,et al.  Characterizing and measuring bias in sequence data , 2013, Genome Biology.

[59]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[60]  Bram van Ginneken,et al.  A survey on deep learning in medical image analysis , 2017, Medical Image Anal..

[61]  De-Shuang Huang,et al.  ChIP-PIT: Enhancing the Analysis of ChIP-Seq Data Using Convex-Relaxed Pair-Wise Interaction Tensor Decomposition , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[62]  Ren Long,et al.  iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition , 2016, Bioinform..

[63]  Geoffrey E. Hinton,et al.  Learning Distributed Representations of Concepts Using Linear Relational Embedding , 2001, IEEE Trans. Knowl. Data Eng..

[64]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[65]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[66]  Etienne Perot,et al.  Deep Reinforcement Learning framework for Autonomous Driving , 2017, Autonomous Vehicles and Machines.

[67]  Simon C. K. Shiu,et al.  Molecular Pattern Discovery Based on Penalized Matrix Decomposition , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[68]  David J. Arenillas,et al.  JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles , 2013, Nucleic Acids Res..

[69]  Fan Yang,et al.  iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC , 2018, Bioinform..

[70]  Jacob F. Degner,et al.  Sequence and Chromatin Accessibility Data Accurate Inference of Transcription Factor Binding from Dna Material Supplemental Open Access , 2022 .

[71]  Kun Wang,et al.  Methylation-mediated silencing of the miR-124 genes facilitates pancreatic cancer progression and metastasis by targeting Rac1 , 2014, Oncogene.

[72]  M. Karin,et al.  Too many transcription factors: positive and negative interactions. , 1990, The New biologist.

[73]  De-Shuang Huang,et al.  Normalized Feature Vectors: A Novel Alignment-Free Sequence Comparison Method Based on the Numbers of Adjacent Amino Acids , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[74]  De-Shuang Huang,et al.  Independent component analysis-based penalized discriminant method for tumor classification using gene expression data , 2006, Bioinform..

[75]  Stefan C. Kremer,et al.  Recurrent Neural Networks , 2013, Handbook on Neural Information Processing.

[76]  David K. Gifford,et al.  Convolutional neural network architectures for predicting DNA–protein binding , 2016, Bioinform..

[77]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[78]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[79]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[80]  B. Deplancke,et al.  The Genetics of Transcription Factor DNA Binding Variation , 2016, Cell.

[81]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[82]  De-Shuang Huang,et al.  A Constructive Hybrid Structure Optimization Methodology for Radial Basis Probabilistic Neural Networks , 2008, IEEE Transactions on Neural Networks.

[83]  David K. Gifford,et al.  GERV: A Statistical Method for Generative Evaluation of Regulatory Variants for Transcription Factor Binding , 2015, bioRxiv.

[84]  P. Mahadevan,et al.  An overview , 2007, Journal of Biosciences.

[85]  Raluca Gordân,et al.  Protein−DNA binding in the absence of specific base-pair recognition , 2014, Proceedings of the National Academy of Sciences.

[86]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[87]  Zhu-Hong You,et al.  t-LSE: A Novel Robust Geometric Approach for Modeling Protein-Protein Interaction Networks , 2013, PloS one.

[88]  Avanti Shrikumar,et al.  Reverse-complement parameter sharing improves deep learning models for genomics , 2017, bioRxiv.

[89]  Bin Liu,et al.  BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches , 2019, Briefings Bioinform..

[90]  Zhen Wang,et al.  SFAPS: An R package for structure/function analysis of protein sequences based on informational spectrum method , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

[91]  Fangxue Sherry He,et al.  Systematic identification of mammalian regulatory motifs' target genes and functions , 2008, Nature Methods.

[92]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[93]  Morteza Mohammad Noori,et al.  gkmSVM: an R package for gapped-kmer SVM , 2016, Bioinform..

[94]  Stephen Grossberg,et al.  Recurrent neural networks , 2013, Scholarpedia.

[95]  Benjamin J. Strober,et al.  A method to predict the impact of regulatory variants from DNA sequence , 2015, Nature Genetics.