Deep learning approaches that have produced breakthrough predictive models in computer vision, speech recognition and machine translation are now being successfully applied to problems in regulatory genomics. However, deep learning architectures used thus far in genomics are often directly ported from computer vision and natural language processing applications with few, if any, domain-specific modifications. In double-stranded DNA, the same pattern may appear identically on one strand and its reverse complement due to complementary base pairing. Here, we show that conventional deep learning models that do not explicitly model this property can produce substantially different predictions on forward and reverse-complement versions of the same DNA sequence. We present four new convolutional neural network layers that leverage the reverse-complement property of genomic DNA sequence by sharing parameters between forward and reverse-complement representations in the model. These layers guarantee that forward and reverse-complement sequences produce identical predictions within numerical precision. Using experiments on simulated and in vivo transcription factor binding data, we show that our proposed architectures lead to improved performance, faster learning and cleaner internal representations compared to conventional architectures trained on the same data. Availability Our implementation is available at https://github.com/kundajelab/keras/tree/keras_1 Contact avanti@stanford.edu, pgreens@stanford.edu, akundaje@stanford.edu
[1]
ENCODEConsortium,et al.
An Integrated Encyclopedia of DNA Elements in the Human Genome
,
2012,
Nature.
[2]
Jimmy Ba,et al.
Adam: A Method for Stochastic Optimization
,
2014,
ICLR.
[3]
B. Frey,et al.
Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning
,
2015,
Nature Biotechnology.
[4]
Yanjun Qi,et al.
Deep Motif: Visualizing Genomic Sequence Classifications
,
2016,
ArXiv.
[5]
Manolis Kellis,et al.
Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments
,
2013,
Nucleic acids research.
[6]
Sergey Ioffe,et al.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
,
2015,
ICML.
[7]
David R. Kelley,et al.
Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks
,
2015,
bioRxiv.
[8]
Anna Shcherbina,et al.
Not Just a Black Box: Learning Important Features Through Propagating Activation Differences
,
2016,
ArXiv.
[9]
Christophe Garcia,et al.
Simplifying ConvNets for Fast Learning
,
2012,
ICANN.
[10]
John Salvatier,et al.
Theano: A Python framework for fast computation of mathematical expressions
,
2016,
ArXiv.
[11]
Marc D. Perry,et al.
ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia
,
2012,
Genome research.