DeepATT: a hybrid category attention neural network for identifying functional effects of DNA sequences

Quantifying DNA properties is a challenging task in the broad field of human genomics. Since the vast majority of non-coding DNA is still poorly understood in terms of function, this task is particularly important to have enormous benefit for biology research. Various DNA sequences should have a great variety of representations, and specific functions may focus on corresponding features in the front part of learning model. Currently, however, for multi-class prediction of non-coding DNA regulatory functions, most powerful predictive models do not have appropriate feature extraction and selection approaches for specific functional effects, so that it is difficult to gain a better insight into their internal correlations. Hence, we design a category attention layer and category dense layer in order to select efficient features and distinguish different DNA functions. In this study, we propose a hybrid deep neural network method, called DeepATT, for identifying $919$ regulatory functions on nearly $5$ million DNA sequences. Our model has four built-in neural network constructions: convolution layer captures regulatory motifs, recurrent layer captures a regulatory grammar, category attention layer selects corresponding valid features for different functions and category dense layer classifies predictive labels with selected features of regulatory functions. Importantly, we compare our novel method, DeepATT, with existing outstanding prediction tools, DeepSEA and DanQ. DeepATT performs significantly better than other existing tools for identifying DNA functions, at least increasing $1.6\%$ area under precision recall. Furthermore, we can mine the important correlation among different DNA functions according to the category attention module. Moreover, our novel model can greatly reduce the number of parameters by the mechanism of attention and locally connected, on the basis of ensuring accuracy.

[1]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[2]  Matthew Slattery,et al.  Absence of a simple code: how transcription factors read the genome. , 2014, Trends in biochemical sciences.

[3]  Jürgen Schmidhuber,et al.  Multi-dimensional Recurrent Neural Networks , 2007, ICANN.

[4]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[5]  Yi Li,et al.  Gene expression inference with deep learning , 2015, bioRxiv.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8]  Feng Luo,et al.  DeepSignal: detecting DNA methylation state from Nanopore sequencing reads using deep-learning. , 2019, Bioinformatics.

[9]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[10]  Guido Sanguinetti,et al.  Explorer Transcription factor binding predicts histone modifications in human cell lines , 2017 .

[11]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[12]  Mohamed Chaabane,et al.  Comprehensive evaluation of deep learning architectures for prediction of DNA/RNA sequence binding specificities , 2019, Bioinform..

[13]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[14]  Xiaohui Xie,et al.  DANN: a deep learning approach for annotating the pathogenicity of genetic variants , 2015, Bioinform..

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Mathieu Blanchette,et al.  Prediction of mRNA subcellular localization using deep recurrent neural networks , 2019, Bioinform..

[17]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[18]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[20]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[21]  David J. Arenillas,et al.  JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles , 2015, Nucleic Acids Res..

[22]  Chao Ren,et al.  BiRen: predicting enhancers with a deep‐learning‐based model using the DNA sequence alone , 2017, Bioinform..

[23]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[24]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[25]  Wei Wang,et al.  Predicting the Human Epigenome from DNA Motifs , 2014, Nature Methods.