Gkmexplain: Fast and Accurate Interpretation of Nonlinear Gapped k-mer SVMs Using Integrated Gradients

Support Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM), or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose gkmexplain: a novel approach inspired by the method of Integrated Gradients for interpreting gkm-SVM models. Using simulated regulatory DNA sequences, we show that gkmexplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. We use a novel motif discovery method called TF-MoDISco to recover consolidated TF motifs from gkm-SVM models of in vivo TF binding by aggregating predictive patterns identified by gkmexplain. Finally, we find that mutation impact scores derived through gkmexplain using gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines consistently outperform deltaSVM and ISM at identifying regulatory genetic variants (dsQTLs). Code and example notebooks replicating the workflow are available at https://github.com/kundajelab/gkmexplain. Explanatory videos available at http://bit.ly/gkmexplainvids.

[1]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[2]  Benjamin J. Strober,et al.  A method to predict the impact of regulatory variants from DNA sequence , 2015, Nature Genetics.

[3]  Jie Wang,et al.  Factorbook.org: a Wiki-based database for transcription factor-binding data generated by the ENCODE consortium , 2012, Nucleic Acids Res..

[4]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[5]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[6]  Joseph K. Pickrell,et al.  DNaseI sensitivity QTLs are a major determinant of human expression variation , 2011, Nature.

[7]  C. Glass,et al.  Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. , 2010, Molecular cell.

[8]  Dongwon Lee,et al.  LS-GKM: a new gkm-SVM for large-scale datasets , 2016, Bioinform..

[9]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[10]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[11]  Burkhard Rost,et al.  Comprehensive in silico mutagenesis highlights functionally important residues in proteins , 2008, ECCB.

[12]  Manolis Kellis,et al.  Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments , 2013, Nucleic acids research.

[13]  Christina S. Leslie,et al.  Fast String Kernels using Inexact Matching for Protein Sequences , 2004, J. Mach. Learn. Res..

[14]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[15]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..