DeepCLIP: predicting the effect of mutations on protein–RNA binding with deep learning

Nucleotide variants can cause functional changes by altering protein-RNA binding in various and subtle ways that are not easy to predict. This can affect processes such as splicing, nuclear shuttling, and stability of the transcript. Therefore, correct modelling of protein-RNA binding is critical when predicting the effects of sequence variations. Many RNA-binding proteins recognize a diverse set of motifs and binding is typically also dependent on the genomic context, making this task particularly challenging. Although existing protein binding site models incorporate various additional data sources to incorporate context, such as RNA structure and functional gene context, they still need improvement and they have not been developed to predict the effect of sequence variants. Here, we present DeepCLIP, the first method for context-aware modeling and predicting protein binding to nucleic acids using exclusively sequence data as input. We show that DeepCLIP outperforms existing methods for modelling RNA-protein binding. Importantly, we demonstrate that DeepCLIP is able to reliably predict the functional effects of contextually dependent nucleotide variants in independent wet lab experiments. Furthermore, we show how DeepCLIP binding profiles can be used in the design of therapeutically relevant antisense oligonucleotides, and to uncover possible position-dependent regulation in a tissue-specific manner. DeepCLIP can be freely used at http://deepclip.compbio.sdu.dk. Highlights We have designed DeepCLIP as a simple neural network that requires only CLIP binding sites as input. The architecture and parameter settings of DeepCLIP makes it an efficient classifier and robust to train, making high performing models easy to train and recreate. Using an extensive benchmark dataset, we demonstrate that DeepCLIP outperforms existing tools in classification. Furthermore, DeepCLIP provides direct information about the neural network’s decision process through visualization of binding motifs and a binding profile that directly indicates sequence elements contributing to the classification. To show that DeepCLIP models generalize to different datasets we have demonstrated that predictions correlate with in vivo and in vitro experiments using quantitative binding assays and minigenes. Identifying the binding sites for regulatory RNA-binding proteins is fundamental for efficient design of (therapeutic) antisense oligonucleotides. Employing a reported disease associated mutation, we demonstrate that DeepCLIP can be used for design of therapeutic antisense oligonucleotides that block regions important for binding of regulatory proteins and correct aberrant splicing. Using DeepCLIP binding profiles, we uncovered a possible position-dependent mechanism behind the reported tissue-specificity of a group of TDP-43 repressed pseudoexons. We have made DeepCLIP available as an online tool for training and application of protein-RNA binding deep learning models and prediction of the potential effects of clinically detected sequence variations (http://deepclip.compbio.sdu.dk/). We also provide DeepCLIP as a configurable stand-alone program (http://www.github.com/deepclip).

[1]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[2]  L. Dember,et al.  Individual RNA Recognition Motifs of TIA-1 and TIAR Have Different RNA Binding Specificities (*) , 1996, The Journal of Biological Chemistry.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  I. Pérez,et al.  Mutation of PTB binding sites causes misregulation of alternative 3' splice site selection in vivo. , 1997, RNA.

[5]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[6]  Phillip A Sharp,et al.  Predictive Identification of Exonic Splicing Enhancers in Human Genes , 2002, Science.

[7]  A. Krainer,et al.  Listening to silence and understanding nonsense: exonic mutations that affect splicing , 2002, Nature Reviews Genetics.

[8]  Jinhua Wang,et al.  ESEfinder: a web resource to identify exonic splicing enhancers , 2003, Nucleic Acids Res..

[9]  E. Buratti,et al.  Nuclear factor TDP-43 binds to the polymorphic TG repeats in CFTR intron 8 and causes skipping of exon 9: a functional link with disease penetrance. , 2004, American journal of human genetics.

[10]  L. Chasin,et al.  Computational definition of sequence motifs governing constitutive exon splicing. , 2004, Genes & development.

[11]  Jernej Ule,et al.  CLIP: a method for identifying protein-RNA interaction sites in living cells. , 2005, Methods.

[12]  K. Suphapeetiporn,et al.  PTEN c.511C>T nonsense mutation in a BRRS family disrupts a potential exonic splicing enhancer and causes exon skipping. , 2006, Japanese journal of clinical oncology.

[13]  L. Waddell,et al.  Medium-chain acyl-CoA dehydrogenase deficiency: genotype-biochemical phenotype correlations. , 2006, Molecular genetics and metabolism.

[14]  M. Hiller,et al.  Using RNA secondary structures to guide sequence motif finding towards single-stranded regions , 2006, Nucleic acids research.

[15]  H. Akiyama,et al.  TDP-43 is a component of ubiquitin-positive tau-negative inclusions in frontotemporal lobar degeneration and amyotrophic lateral sclerosis. , 2006, Biochemical and biophysical research communications.

[16]  Robert Giegerich,et al.  RNAshapes: an integrated RNA analysis package based on abstract shapes. , 2006, Bioinformatics.

[17]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[18]  Bruce L. Miller,et al.  Ubiquitinated TDP-43 in Frontotemporal Lobar Degeneration and Amyotrophic Lateral Sclerosis , 2006, Science.

[19]  Seemingly neutral polymorphic variants may confer immunity to splicing-inactivating mutations: a synonymous SNP in exon 5 of MCAD protects from deleterious mutations in a flanking exonic splicing enhancer. , 2007, American journal of human genetics.

[20]  Y. Hua,et al.  Antisense masking of an hnRNP A1/A2 intronic splicing silencer corrects SMN2 splicing in transgenic mice. , 2008, American journal of human genetics.

[21]  Gene W. Yeo,et al.  Genome-wide analysis of PTB-RNA interactions reveals a strategy used by the general splicing repressor to modulate exon inclusion or skipping. , 2009, Molecular cell.

[22]  Matthew Mort,et al.  Splicing factor SFRS1 recognizes a functionally diverse landscape of RNA transcripts. , 2009, Genome research.

[23]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[24]  C. Glass,et al.  Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. , 2010, Molecular cell.

[25]  Scott B. Dewell,et al.  Transcriptome-wide Identification of RNA-Binding Protein and MicroRNA Target Sites by PAR-CLIP , 2010, Cell.

[26]  Michael Briese,et al.  iCLIP Predicts the Dual Splicing Effects of TIA-RNA Interactions , 2010, PLoS biology.

[27]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[28]  Quaid Morris,et al.  RNAcontext: A New Method for Learning the Sequence and Structure Binding Preferences of RNA-Binding Proteins , 2010, PLoS Comput. Biol..

[29]  J. Ule,et al.  iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution , 2010, Nature Structural &Molecular Biology.

[30]  M. Zavolan,et al.  A quantitative analysis of CLIP methods for identifying binding sites of RNA-binding proteins , 2011, Nature Methods.

[31]  Xin Wang,et al.  Predicting sequence and structural specificities of RNA binding regions recognized by splicing factor SRSF1 , 2011, BMC Genomics.

[32]  Uwe Ohler,et al.  Integrative regulatory mapping indicates that the RNA-binding protein HuR couples pre-mRNA processing and mRNA stability. , 2011, Molecular cell.

[33]  Chris Sander,et al.  RNA targets of wild-type and mutant FET family proteins , 2011, Nature Structural &Molecular Biology.

[34]  Peter Johnson,et al.  Prediction of single‐nucleotide substitutions that result in exon skipping: identification of a splicing silencer in BRCA1 exon 6 , 2011, Human mutation.

[35]  N. Rajewsky,et al.  Transcriptome-wide analysis of regulatory interactions of the RNA-binding protein HuR. , 2011, Molecular cell.

[36]  F. Collart,et al.  Environment sensing and response mediated by ABC transporters , 2011, BMC Genomics.

[37]  Richard Bonneau,et al.  The mRNA-bound proteome and its global occupancy profile on protein-coding transcripts. , 2012, Molecular cell.

[38]  Renato Paro,et al.  Mixture models and wavelet transforms reveal high confidence RNA-protein interaction sites in MOV10 PAR-CLIP data , 2012, Nucleic acids research.

[39]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[40]  J. Wilce,et al.  Sequence requirements for RNA binding by HuR and AUF1. , 2012, Journal of biochemistry.

[41]  Srinivas C. Turaga,et al.  Connectomic reconstruction of the inner plexiform layer in the mouse retina , 2013, Nature.

[42]  R. Backofen,et al.  GraphProt: modeling binding preferences of RNA-binding proteins , 2014, Genome Biology.

[43]  Christopher R. Sibley,et al.  iCLIP: Protein–RNA interactions at nucleotide resolution , 2014, Methods.

[44]  Brendan J. Frey,et al.  Deep learning of the tissue-regulated splicing code , 2014, Bioinform..

[45]  C. Dieterich,et al.  MOV10 Is a 5' to 3' RNA helicase contributing to UPF1 mRNA target degradation by translocation along 3' UTRs. , 2014, Molecular cell.

[46]  B. Frey,et al.  The human splicing code reveals new insights into the genetic determinants of disease , 2015, Science.

[47]  P. Wong,et al.  TDP-43 repression of nonconserved cryptic exons is compromised in ALS-FTD , 2015, Science.

[48]  Colin Raffel,et al.  Lasagne: First release. , 2015 .

[49]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[50]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[51]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[52]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[53]  Marinka Zitnik,et al.  Orthogonal matrix factorization enables integrative analysis of multiple RNA binding proteins , 2016, Bioinform..

[54]  A. Krainer,et al.  Global identification of hnRNP A1 binding sites for SSO-based splicing modulation , 2016, BMC Biology.

[55]  P. Wong,et al.  Tdp-43 cryptic exons are highly variable between cell types , 2017, Molecular Neurodegeneration.

[56]  A. Masuda,et al.  IntSplice: prediction of the splicing consequences of intronic single-nucleotide variations in the human genome , 2016, Journal of Human Genetics.

[57]  Yoav Goldberg,et al.  A Primer on Neural Network Models for Natural Language Processing , 2015, J. Artif. Intell. Res..

[58]  Gene W. Yeo,et al.  Robust transcriptome-wide discovery of RNA binding protein binding sites with enhanced CLIP (eCLIP) , 2016, Nature Methods.

[59]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[60]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[61]  Seunghyun Park,et al.  deepMiRGene: Deep Neural Network based Precursor microRNA Prediction , 2016, ArXiv.

[62]  Jianyang Zeng,et al.  A deep learning framework for modeling structural features of RNA-binding protein targets , 2015, Nucleic acids research.

[63]  Jason Weston,et al.  Tracking the World State with Recurrent Entity Networks , 2016, ICLR.

[64]  David A. Hendrix,et al.  A Deep Recurrent Neural Network Discovers Complex Biological Rules to Decipher RNA Protein-Coding Potential , 2017 .

[65]  Ole Winther,et al.  DeepLoc: prediction of protein subcellular localization using deep learning , 2017, Bioinform..

[66]  R. J. Ramamurthi,et al.  Nusinersen versus Sham Control in Infantile‐Onset Spinal Muscular Atrophy , 2017, The New England journal of medicine.

[67]  Hong-Bin Shen,et al.  RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach , 2016, BMC Bioinformatics.

[68]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Alexander G. B. Grønning,et al.  Blocking of an intronic splicing silencer completely rescues IKBKAP exon 20 splicing in familial dysautonomia patient cells , 2018, Nucleic acids research.

[70]  Y. Hua,et al.  Antisense oligonucleotides correct the familial dysautonomia splicing defect in IKBKAP transgenic mice , 2018, Nucleic acids research.

[71]  Junchi Yan,et al.  Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks , 2017, BMC Genomics.

[72]  Sebastien M. Weyn-Vanhentenryck,et al.  Modeling RNA-binding protein specificity in vivo by precisely registering protein-RNA crosslink sites , 2018, bioRxiv.

[73]  R. Finkel,et al.  Nusinersen versus Sham Control in Later‐Onset Spinal Muscular Atrophy , 2018, The New England journal of medicine.

[74]  Sebastien M. Weyn-Vanhentenryck,et al.  Modeling RNA-Binding Protein Specificity In Vivo by Precisely Registering Protein-RNA Crosslink Sites. , 2019, Molecular cell.

[75]  Gang Xu,et al.  POSTAR2: deciphering the post-transcriptional regulatory logics , 2018, Nucleic Acids Res..