Discovering epistatic feature interactions from neural network models of regulatory DNA sequences

Motivation Transcription factors bind regulatory DNA sequences in a combinatorial manner to modulate gene expression. Deep neural networks (DNNs) can learn the cis‐regulatory grammars encoded in regulatory DNA sequences associated with transcription factor binding and chromatin accessibility. Several feature attribution methods have been developed for estimating the predictive importance of individual features (nucleotides or motifs) in any input DNA sequence to its associated output prediction from a DNN model. However, these methods do not reveal higher‐order feature interactions encoded by the models. Results We present a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence. DFIM accurately identifies ground truth motif interactions embedded in simulated regulatory DNA sequences. DFIM identifies synergistic interactions between GATA1 and TAL1 motifs from in vivo TF binding models. DFIM reveals epistatic interactions involving nucleotides flanking the core motif of the Cbf1 TF in yeast from in vitro TF binding models. We also apply DFIM to regulatory sequence models of in vivo chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci. Our approach makes significant strides in improving the interpretability of deep learning models for genomics. Availability and implementation Code is available at: https://github.com/kundajelab/dfim. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Shamit Soneji,et al.  Genome-wide identification of TAL1's functional targets: insights into its mechanisms of action in primary erythroid cells. , 2010, Genome research.

[2]  Howard Y. Chang,et al.  Lineage-specific and single cell chromatin accessibility charts human hematopoiesis and leukemia evolution , 2016, Nature Genetics.

[3]  Hunter B. Fraser,et al.  Pooled ChIP-Seq Links Variation in Transcription Factor Binding to Complex Disease Risk , 2016, Cell.

[4]  Vsevolod J. Makeev,et al.  What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants , 2019, Front. Genet..

[5]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[6]  David Z. Chen,et al.  Architecture of the human regulatory network derived from ENCODE data , 2012, Nature.

[7]  Avanti Shrikumar,et al.  Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays , 2019, PloS one.

[8]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[9]  Polly M. Fordyce,et al.  Comprehensive, high-resolution binding energy landscapes reveal context dependencies of transcription factor binding , 2017, Proceedings of the National Academy of Sciences.

[10]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[11]  Manolis Kellis,et al.  Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments , 2013, Nucleic acids research.

[12]  David K. Gifford,et al.  Visualizing complex feature interactions and feature sharing in genomic deep neural networks , 2019, BMC Bioinformatics.

[13]  Scott M. Lundberg,et al.  Consistent Individualized Feature Attribution for Tree Ensembles , 2018, ArXiv.

[14]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[15]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[16]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[17]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[18]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[19]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[20]  Jun S. Song,et al.  Maximum entropy methods for extracting the learned features of deep neural networks , 2017, bioRxiv.

[21]  Jens Lichtenberg,et al.  An integrative view of the regulatory and transcriptional landscapes in mouse hematopoiesis , 2019, bioRxiv.

[22]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[23]  K. Tan,et al.  Exploiting genetic variation to uncover rules of transcription factor binding and chromatin accessibility , 2018, Nature Communications.