fastISM: Performant in-silico saturation mutagenesis for convolutional neural networks

Deep learning models such as convolutional neural networks are able to accurately map biological sequences to associated functional readouts and properties by learning predictive de novo representations. In-silico saturation mutagenesis (ISM) is a popular feature attribution technique for inferring contributions of all characters in an input sequence to the model’s predicted output. The main drawback of ISM is its runtime, as it involves multiple forward propagations of all possible mutations of each character in the input sequence through the trained model to predict the effects on the output. We present fastISM, an algorithm that speeds up ISM by a factor of over 10x for commonly used convolutional neural network architectures. fastISM is based on the observations that the majority of computation in ISM is spent in convolutional layers, and a single mutation only disrupts a limited region of intermediate layers, rendering most computation redundant. fastISM reduces the gap between backpropagation-based feature attribution methods and ISM. It far surpasses the runtime of backpropagation-based methods on multi-output architectures, making it feasible to run ISM on a large number of sequences. An easy-to-use Keras/TensorFlow 2 implementation of fastISM is available at https://github.com/kundajelab/fastISM, and a hands-on tutorial at https://colab.research.google.com/github/kundajelab/fastISM/blob/master/notebooks/colab/DeepSEA.ipynb.

[1]  Chandra L. Theesfeld,et al.  Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk , 2018, Nature Genetics.

[2]  Matt Ploenzke,et al.  Improving representations of genomic sequence motifs in convolutional networks with exponential activations , 2020, Nature Machine Intelligence.

[3]  De-Shuang Huang,et al.  Recurrent Neural Network for Predicting Transcription Factor Binding Sites , 2018, Scientific Reports.

[4]  Mohamed Chaabane,et al.  Comprehensive evaluation of deep learning architectures for prediction of DNA/RNA sequence binding specificities , 2019, Bioinform..

[5]  Charles E. McAnany,et al.  Deep learning at base-resolution reveals motif syntax of the cis-regulatory code , 2019, bioRxiv.

[6]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[7]  David G. Knowles,et al.  Predicting Splicing from Primary Sequence with Deep Learning , 2019, Cell.

[8]  Fabian J Theis,et al.  Deep learning: new computational modelling techniques for genomics , 2019, Nature Reviews Genetics.

[9]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[10]  May D. Wang,et al.  DeeperBind: Enhancing Prediction of Sequence Specificities of DNA Binding Proteins , 2016, bioRxiv.

[11]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[12]  Charles J. Vaske,et al.  Deep Learning Implicitly Handles Tissue Specific Phenomena to Predict Tumor DNA Accessibility and Immune Activity , 2019, iScience.

[13]  David R. Kelley,et al.  Sequential regulatory activity prediction across chromosomes with convolutional neural networks. , 2018, Genome research.

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  Agata Wesolowska-Andersen,et al.  Deep learning models predict regulatory variants in pancreatic islets and refine type 2 diabetes association signals , 2020, eLife.

[16]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[17]  Sean R. Eddy,et al.  Inferring Sequence-Structure Preferences of RNA-Binding Proteins with Convolutional Residual Networks , 2018, bioRxiv.

[18]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[19]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[20]  David R. Kelley,et al.  Predicting 3D genome folding from DNA sequence with Akita , 2020, Nature Methods.

[21]  Jay Shendure,et al.  High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis , 2009, Nature Biotechnology.

[22]  Avanti Shrikumar,et al.  Technical Note on Transcription Factor Motif Discovery from Importance Scores (TF-MoDISco) version 0.5.6.5 , 2018, 1811.00416.

[23]  Gianluca Pollastri,et al.  Deep learning methods in protein structure prediction , 2020, Computational and structural biotechnology journal.

[24]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[25]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[26]  Anupama Jha,et al.  Enhanced Integrated Gradients: improving interpretability of deep learning models using splicing codes as a case study , 2020, Genome Biology.