Predicting gene regulatory regions with a convolutional neural network for processing double-strand genome sequence information

With advances in sequencing technology, a vast amount of genomic sequence information has become available. However, annotating biological functions particularly of non-protein-coding regions in genome sequences without experiments is still a challenging task. Recently deep learning–based methods were shown to have the ability to predict gene regulatory regions from genome sequences, promising to aid the interpretation of genomic sequence data. Here, we report an improvement of the prediction accuracy for gene regulatory regions by using the design of convolution layers that efficiently process genomic sequence information, and developed a software, DeepGMAP, to train and compare different deep learning–based models (https://github.com/koonimaru/DeepGMAP). First, we demonstrate that our convolution layers, termed forward- and reverse-sequence scan (FRSS) layers, integrate both forward and reverse strand information, and enhance the power to predict gene regulatory regions. Second, we assessed previous studies and identified problems associated with data structures that caused overfitting. Finally, we introduce visualization methods to examine what the program learned. Together, our FRSS layers improve the prediction accuracy for gene regulatory regions.

[1]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[2]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[3]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[4]  S. Yamanaka,et al.  Induction of Pluripotent Stem Cells from Mouse Embryonic and Adult Fibroblast Cultures by Defined Factors , 2006, Cell.

[5]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[6]  William Stafford Noble,et al.  FIMO: scanning for occurrences of a given motif , 2011, Bioinform..

[7]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[8]  V. Corces,et al.  CTCF: an architectural protein bridging genome topology and function , 2014, Nature Reviews Genetics.

[9]  Claudius F. Kratochwil,et al.  Closing the genotype–phenotype gap: Emerging technologies for evolutionary genetics in ecological model vertebrate systems , 2015, BioEssays : news and reviews in molecular, cellular and developmental biology.

[10]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[11]  Ning Chen,et al.  Predicting enhancers with deep convolutional neural networks , 2017, BMC Bioinformatics.

[12]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[13]  Shane J. Neph,et al.  A comparative encyclopedia of DNA elements in the mouse genome , 2014, Nature.

[14]  K. Lindblad-Toh,et al.  Dissecting evolution and disease using comparative vertebrate genomics , 2017, Nature Reviews Genetics.

[15]  J. N. Mark Glover,et al.  Crystal structure of the heterodimeric bZIP transcription factor c-Fos–c-Jun bound to DNA , 1995, Nature.

[16]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[17]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[18]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[19]  Michael D. Wilson,et al.  Waves of Retrotransposon Expansion Remodel Genome Organization and CTCF Binding in Multiple Mammalian Lineages , 2012, Cell.

[20]  Fidel Ramírez,et al.  deepTools2: a next generation web server for deep-sequencing data analysis , 2016, Nucleic Acids Res..

[21]  T. Mikkelsen,et al.  The NIH Roadmap Epigenomics Mapping Consortium , 2010, Nature Biotechnology.

[22]  Y. Kyōgoku,et al.  Crystal structure of PHO4 bHLH domain–DNA complex: flanking base recognition , 1997, The EMBO journal.

[23]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[24]  T. Meehan,et al.  An atlas of active enhancers across human cell types and tissues , 2014, Nature.

[25]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[26]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[27]  Marc D. Perry,et al.  ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia , 2012, Genome research.

[28]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[29]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..