Interpretable Convolution Methods for Learning Genomic Sequence Motifs

The first-layer filters employed in convolutional neural networks tend to learn, or extract, spatial features from the data. Within their application to genomic sequence data, these learned features are often visualized and interpreted by converting them to sequence logos; an information-based representation of the consensus nucleotide motif. The process to obtain such motifs, however, is done through post-training procedures which often discard the filter weights themselves and instead rely upon finding those sequences maximally correlated with the given filter. Moreover, the filters collectively learn motifs with high redundancy, often simply shifted representations of the same sequence. We propose a schema to learn sequence motifs directly through weight constraints and transformations such that the individual weights comprising the filter are directly interpretable as either position weight matrices (PWMs) or information gain matrices (IGMs). We additionally leverage regularization to encourage learning highly-representative motifs with low inter-filter redundancy. Through learning PWMs and IGMs directly we present preliminary results showcasing how our method is capable of incorporating previously-annotated database motifs along with learning motifs de novo and then outline a pipeline for how these tools may be used jointly in a data application.

[1]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[2]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[3]  Markus Kollmann,et al.  Circularly shifted filters enable data efficient sequence motif inference with neural networks , 2018, bioRxiv.

[4]  Anshul Kundaje,et al.  Discovering epistatic feature interactions from neural network models of regulatory DNA sequences , 2018, bioRxiv.

[5]  Marc D. Perry,et al.  ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia , 2012, Genome research.

[6]  David J. Arenillas,et al.  JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework , 2017, Nucleic acids research.

[7]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[8]  Yanjun Qi,et al.  Deep Motif: Visualizing Genomic Sequence Classifications , 2016, ArXiv.

[9]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[10]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[11]  Jonathon Shlens,et al.  Notes on Kullback-Leibler Divergence and Likelihood , 2014, ArXiv.

[12]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[13]  Avanti Shrikumar,et al.  A multi-modal neural network for learning cis and trans regulation of stress response in yeast. , 2019 .

[14]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[15]  Sung Ju Hwang,et al.  Combined Group and Exclusive Sparsity for Deep Neural Networks , 2017, ICML.

[16]  Julian A. Peterson,et al.  A microcomputer network for biochemistry , 1985, Comput. Appl. Biosci..

[17]  Geoffrey J. McLachlan,et al.  Using the EM algorithm to train neural networks: misconceptions and a new algorithm for multiclass classification , 2004, IEEE Transactions on Neural Networks.

[18]  Avanti Shrikumar,et al.  Reverse-complement parameter sharing improves deep learning models for genomics , 2017, bioRxiv.

[19]  Daniel Quang,et al.  YAMDA: thousandfold speedup of EM-based motif discovery using deep learning libraries and GPU , 2018, Bioinform..

[20]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[21]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[22]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[23]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[24]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[25]  S. Kuraku,et al.  A regulatory-sequence classifier with a neural network for genomic information processing , 2018, bioRxiv.

[26]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[27]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[28]  Avanti Shrikumar,et al.  SEPARABLE FULLY CONNECTED LAYERS IMPROVE DEEP LEARNING MODELS FOR GENOMICS , 2017 .

[29]  Robert L. Wolpert,et al.  Statistical Inference , 2019, Encyclopedia of Social Network Analysis and Mining.

[30]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .