A survey on deep learning in DNA/RNA motif mining

DNA/RNA motif mining is the foundation of gene function research. The DNA/RNA motif mining plays an extremely important role in identifying the DNA- or RNA-protein binding site, which helps to understand the mechanism of gene regulation and management. For the past few decades, researchers have been working on designing new efficient and accurate algorithms for mining motif. These algorithms can be roughly divided into two categories: the enumeration approach and the probabilistic method. In recent years, machine learning methods had made great progress, especially the algorithm represented by deep learning had achieved good performance. Existing deep learning methods in motif mining can be roughly divided into three types of models: convolutional neural network (CNN) based models, recurrent neural network (RNN) based models, and hybrid CNN-RNN based models. We introduce the application of deep learning in the field of motif mining in terms of data preprocessing, features of existing deep learning architectures and comparing the differences between the basic deep learning models. Through the analysis and comparison of existing deep learning methods, we found that the more complex models tend to perform better than simple ones when data are sufficient, and the current methods are relatively simple compared with other fields such as computer vision, language processing (NLP), computer games, etc. Therefore, it is necessary to conduct a summary in motif mining by deep learning, which can help researchers understand this field.

[1]  A. Sharov,et al.  Exhaustive Search for Over-represented DNA Sequence Motifs with CisFinder , 2009, DNA research : an international journal for rapid publication of reports on genes and genomes.

[2]  Xiaowei Zhou,et al.  A Survey on Rotation Optimization in Structure from Motion , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[3]  A. Cornish-Bowden Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. , 1985, Nucleic acids research.

[4]  De-Shuang Huang,et al.  RNA-Protein Binding Sites Prediction via Multi Scale Convolutional Gated Recurrent Unit Networks , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Timothy L. Bailey,et al.  Gene expression Advance Access publication May 4, 2011 DREME: motif discovery in transcription factor ChIP-seq data , 2011 .

[6]  Xiangrong Liu,et al.  Identifying enhancer-promoter interactions with neural network based on pre-trained DNA vectors and attention mechanism , 2019, Bioinform..

[7]  R. Vogt,et al.  Relationship between multiple paternity and reproductive parameters for Podocnemis sextuberculata (Testudines: Podocnemididae) in the Trombetas River, Brazil. , 2016, Genetics and molecular research : GMR.

[8]  Caiyan Jia,et al.  A New Exhaustive Method and Strategy for Finding Motifs in ChIP-Enriched Regions , 2014, PloS one.

[9]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[10]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[11]  M. Othman,et al.  Anaerobic Codigestion of Municipal Wastewater Treatment Plant Sludge with Food Waste: A Case Study , 2016, BioMed research international.

[12]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[13]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[14]  Graziano Pesole,et al.  An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[15]  Leopold Parts,et al.  Accurate Classification of Protein Subcellular Localization from High-Throughput Microscopy Images Using Deep Learning , 2016, G3: Genes, Genomes, Genetics.

[16]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[17]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Yan Wang,et al.  Prediction of regulatory motifs from human Chip-sequencing data using a deep learning framework , 2019, Nucleic acids research.

[19]  Xuhua Xia,et al.  Position Weight Matrix, Gibbs Sampler, and the Associated Significance Tests in Motif Characterization and Prediction , 2012, Scientifica.

[20]  Nicolas Pinto,et al.  SkData: Data Sets and Algorithm Evaluation Protocols in Python , 2013 .

[21]  J. van Helden,et al.  RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets , 2011, Nucleic acids research.

[22]  Jeffrey Scott Vitter,et al.  An Efficient Algorithm for Discovering Motifs in Large DNA Data Sets , 2015, IEEE Transactions on NanoBioscience.

[23]  Junchi Yan,et al.  Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks , 2017, BMC Genomics.

[24]  D Karaboga,et al.  A discrete artificial bee colony algorithm for detecting transcription factor binding sites in DNA sequences. , 2016, Genetics and molecular research : GMR.

[25]  Saurabh Sinha,et al.  On counting position weight matrix matches in a sequence, with application to discriminative motif finding , 2006, ISMB.

[26]  Beilun Wang,et al.  Deep Motif Dashboard: Visualizing and Understanding Genomic Sequences Using Deep Neural Networks , 2016, PSB.

[27]  Graziano Pesole,et al.  Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes , 2004, Nucleic Acids Res..

[28]  Anshul Kundaje,et al.  Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts , 2019, Bioinform..

[29]  O. Stegle,et al.  DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning , 2016, Genome Biology.

[30]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[31]  Xin Rong,et al.  word2vec Parameter Learning Explained , 2014, ArXiv.

[32]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[33]  Jianyang Zeng,et al.  A deep learning framework for modeling structural features of RNA-binding protein targets , 2015, Nucleic acids research.

[34]  Kai Blin,et al.  DoRiNA 2.0—upgrading the doRiNA database of RNA interactions in post-transcriptional regulation , 2014, Nucleic Acids Res..

[35]  Qiao Liu,et al.  Chromatin accessibility prediction via a hybrid deep convolutional neural network , 2017, Bioinform..

[36]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  De-Shuang Huang,et al.  Modeling in-vivo protein-DNA binding by combining multiple-instance learning with a hybrid deep neural network , 2019, Scientific Reports.

[38]  O. Stegle,et al.  Deep learning for computational biology , 2016, Molecular systems biology.

[39]  Hayit Greenspan,et al.  Deep learning with non-medical training used for chest pathology identification , 2015, Medical Imaging.

[40]  Li Shang,et al.  Motif Discovery via Convolutional Networks with K-mer Embedding , 2019, ICIC.

[41]  David R. Westhead,et al.  Improved prediction of protein-protein binding sites using a support vector machines approach. , 2005, Bioinformatics.

[42]  Graziano Pesole,et al.  In silico representation and discovery of transcription factor binding sites , 2004, Briefings Bioinform..

[43]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[44]  Alexander M. Rush,et al.  Dilated Convolutions for Modeling Long-Distance Genomic Dependencies , 2017, bioRxiv.

[45]  De-Shuang Huang,et al.  DCDE: An Efficient Deep Convolutional Divergence Encoding Method for Human Promoter Recognition , 2019, IEEE Transactions on NanoBioscience.

[46]  S. Gerstberger,et al.  A census of human RNA-binding proteins , 2014, Nature Reviews Genetics.

[47]  Zhen Shen,et al.  Capsule Network for Predicting RNA-Protein Binding Preferences Using Hybrid Feature , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[48]  Brendan J. Frey,et al.  Deep learning of the tissue-regulated splicing code , 2014, Bioinform..

[49]  Zhen Shen,et al.  Predicting in-vitro Transcription Factor Binding Sites Using DNA Sequence + Shape , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[50]  Li Shang,et al.  Hierarchical Attention Network for Predicting DNA-Protein Binding Sites , 2019, ICIC.

[51]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[52]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[53]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[54]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[55]  Yi Pan,et al.  Understanding the Prediction of Transmembrane Proteins by Support Vector Machine using Association Rule Mining , 2007, 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[56]  G. K. Sandve,et al.  A survey of motif discovery methods in an integrated framework , 2006, Biology Direct.

[57]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[58]  Saurabh Sinha,et al.  YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation , 2003, Nucleic Acids Res..

[59]  De-Shuang Huang,et al.  High-Order Convolutional Neural Network Architecture for Predicting DNA-Protein Binding Sites , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[60]  Marinka Zitnik,et al.  Orthogonal matrix factorization enables integrative analysis of multiple RNA binding proteins , 2016, Bioinform..

[61]  Ning Chen,et al.  Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding , 2017, Bioinform..

[62]  De-Shuang Huang,et al.  Weakly-Supervised Convolutional Neural Network Architecture for Predicting Protein-DNA Binding , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[63]  De-Shuang Huang,et al.  Imputation of ChIP-seq datasets via Low Rank Convex Co-Embedding , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[64]  Walid Al-Atabany,et al.  Review of Different Sequence Motif Finding Algorithms , 2019, Avicenna journal of medical biotechnology.

[65]  Chao Ren,et al.  BiRen: predicting enhancers with a deep‐learning‐based model using the DNA sequence alone , 2017, Bioinform..

[66]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[67]  David D. Cox,et al.  Hyperopt: A Python Library for Optimizing the Hyperparameters of Machine Learning Algorithms , 2013, SciPy.

[68]  Sam Kwong,et al.  G-MS2F: GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition , 2017, Neurocomputing.

[69]  Ping Wang,et al.  An Entropy-Based Position Projection Algorithm for Motif Discovery , 2016, BioMed research international.

[70]  R. Raz,et al.  ProMate: a structure based prediction program to identify the location of protein-protein binding sites. , 2004, Journal of molecular biology.

[71]  Sotirios Bisdas,et al.  Texture analysis- and support vector machine-assisted diffusional kurtosis imaging may allow in vivo gliomas grading and IDH-mutation status prediction: a preliminary study , 2018, Scientific Reports.

[72]  Zachary C. Lipton,et al.  Troubling Trends in Machine Learning Scholarship , 2018, ACM Queue.

[73]  Mingxia Zhang,et al.  Analysis of the Antigenic Properties of Membrane Proteins of Mycobacterium tuberculosis , 2019, Scientific Reports.

[74]  Tommy Kaplan,et al.  Enhancer Identification using Transfer and Adversarial Deep Learning of DNA Sequences , 2018, bioRxiv.

[75]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[76]  Michael Q. Zhang,et al.  A highly efficient and effective motif discovery method for ChIP-seq/ChIP-chip data using positional information , 2011, Nucleic acids research.

[77]  Amir Hussain,et al.  Applications of Deep Learning and Reinforcement Learning to Biological Data , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[78]  Inna Dubchak,et al.  VISTA Enhancer Browser—a database of tissue-specific human enhancers , 2006, Nucleic Acids Res..

[79]  Hanlee P. Ji,et al.  Overview of Sequencing Technology Platforms , 2012 .

[80]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[81]  V. Bajic,et al.  DEEP: a general computational framework for predicting enhancers , 2014, Nucleic acids research.

[82]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[83]  Hong-Bo Zhang,et al.  LMMO: A Large Margin Approach for Refining Regulatory Motifs , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[84]  Bohyung Han,et al.  Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Ole Winther,et al.  DeepLoc: prediction of protein subcellular localization using deep learning , 2017, Bioinform..

[86]  L. Kedes,et al.  Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. Nomenclature Committee of the International Union of Biochemistry (NC-IUB). , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[87]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[88]  Wyeth W Wasserman,et al.  Identification of cis-regulatory sequence variations in individual genome sequences , 2011, Genome Medicine.

[89]  Kyungsook Han,et al.  A Deep Learning Model for RNA-Protein Binding Preference Prediction Based on Hierarchical LSTM and Attention Network , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[90]  Stephan Günnemann,et al.  NetGAN: Generating Graphs via Random Walks , 2018, ICML.

[91]  Yufei Huang,et al.  A deep learning model for predicting transcription factor binding location at single nucleotide resolution , 2017, 2017 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI).

[92]  David Ballard,et al.  DNA methylation-based forensic age prediction using artificial neural networks and next generation sequencing , 2017, Forensic science international. Genetics.

[93]  Ehsaneddin Asgari,et al.  ProtVec: A Continuous Distributed Representation of Biological Sequences , 2015, ArXiv.

[94]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[95]  Eugene V Koonin,et al.  A community experiment with fully open and published peer review , 2006, Biology Direct.

[96]  Alessio Colantoni,et al.  Revealing protein–lncRNA interaction , 2015, Briefings Bioinform..

[97]  De-Shuang Huang,et al.  DiscMLA: AUC-based discriminative motif learning , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[98]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Deep learning for biological image classification , 2017, Expert Syst. Appl..

[99]  De-Shuang Huang,et al.  Recurrent Neural Network for Predicting Transcription Factor Binding Sites , 2018, Scientific Reports.

[100]  Nicola De Cao,et al.  MolGAN: An implicit generative model for small molecular graphs , 2018, ArXiv.

[101]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[102]  R. Parker,et al.  Scd6 targets eIF4G to repress translation: RGG motif proteins as a class of eIF4G-binding proteins. , 2012, Molecular cell.

[103]  Uwe Ohler,et al.  McEnhancer: predicting gene expression via semi-supervised assignment of enhancers to target genes , 2017, Genome Biology.

[104]  Mohamed Chaabane,et al.  Comprehensive evaluation of deep learning architectures for prediction of DNA/RNA sequence binding specificities , 2019, Bioinform..

[105]  De-Shuang Huang,et al.  An efficient method to transcription factor binding sites imputation via simultaneous completion of multiple matrices with positional consistency. , 2017, Molecular bioSystems.

[106]  Hong-Bin Shen,et al.  Predicting RNA‐protein binding sites and motifs through combining local and global deep convolutional neural networks , 2018, Bioinform..

[107]  A. N. Jain,et al.  Hammerhead: fast, fully automated docking of flexible ligands to protein binding sites. , 1996, Chemistry & biology.

[108]  De-Shuang Huang,et al.  Learning regulatory motifs by direct optimization of Fisher Exact Test Score , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[109]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[110]  Graziano Pesole,et al.  Motif discovery and transcription factor binding sites before and after the next-generation sequencing era , 2012, Briefings Bioinform..