SpliceRover: interpretable convolutional neural networks for improved splice site prediction

Motivation: During the last decade, improvements in high‐throughput sequencing have generated a wealth of genomic data. Functionally interpreting these sequences and finding the biological signals that are hallmarks of gene function and regulation is currently mostly done using automated genome annotation platforms, which mainly rely on integrated machine learning frameworks to identify different functional sites of interest, including splice sites. Splicing is an essential step in the gene regulation process, and the correct identification of splice sites is a major cornerstone in a genome annotation system. Results: In this paper, we present SpliceRover, a predictive deep learning approach that outperforms the state‐of‐the‐art in splice site prediction. SpliceRover uses convolutional neural networks (CNNs), which have been shown to obtain cutting edge performance on a wide variety of prediction tasks. We adapted this approach to deal with genomic sequence inputs, and show it consistently outperforms already existing approaches, with relative improvements in prediction effectiveness of up to 80.9% when measured in terms of false discovery rate. However, a major criticism of CNNs concerns their ‘black box’ nature, as mechanisms to obtain insight into their reasoning processes are limited. To facilitate interpretability of the SpliceRover models, we introduce an approach to visualize the biologically relevant information learnt. We show that our visualization approach is able to recover features known to be important for splice site prediction (binding motifs around the splice site, presence of polypyrimidine tracts and branch points), as well as reveal new features (e.g. several types of exclusion patterns near splice sites). Availability and implementation: SpliceRover is available as a web service. The prediction tool and instructions can be found at http://bioit2.irc.ugent.be/splicerover/. Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Yvan Saeys,et al.  SpliceMachine: predicting splice sites from high-dimensional local context representations , 2005, Bioinform..

[2]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[3]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[4]  O. Gotoh,et al.  Comparative analysis of information contents relevant to recognition of introns in many species , 2011, BMC Genomics.

[5]  Sungroh Yoon,et al.  Boosted Categorical Restricted Boltzmann Machine for Computational Prediction of Splice Junctions , 2015, ICML.

[6]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[7]  Alexander Binder,et al.  On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , 2015, PloS one.

[8]  Ole Winther,et al.  Convolutional LSTM Networks for Subcellular Localization of Proteins , 2015, AlCoB.

[9]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[10]  Eric Boerwinkle,et al.  In silico tools for splicing defect prediction - A survey from the viewpoint of end-users , 2013, Genetics in Medicine.

[11]  Kinji Ohno,et al.  Human branch point consensus sequence is yUnAy , 2008, Nucleic acids research.

[12]  Yi Zhang,et al.  DeepSplice: Deep classification of novel splice junctions revealed by RNA-seq , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[13]  David K. Gifford,et al.  Convolutional neural network architectures for predicting DNA–protein binding , 2016, Bioinform..

[14]  C. Gooding,et al.  A class of human exons with predicted distant branch points revealed by analysis of AG dinucleotide exclusion zones , 2006, Genome Biology.

[15]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[16]  Yves Van de Peer,et al.  ORCAE: online resource for community annotation of eukaryotes , 2012, Nature Methods.

[17]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[18]  Victoria Dean,et al.  Deep Learning for Branch Point Selection in RNA Splicing , 2016 .

[19]  Victor V. Solovyev,et al.  SpliceDB: database of canonical and non-canonical mammalian splice sites , 2001, Nucleic Acids Res..

[20]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[21]  Gunnar Rätsch,et al.  Accurate splice site prediction using support vector machines , 2007, BMC Bioinformatics.

[22]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23]  Thomas Schiex,et al.  Genome Annotation in Plants and Fungi: EuGene as a Model Platform , 2008 .

[24]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[25]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[26]  Gill Bejerano,et al.  A sequence-based, deep learning model accurately predicts RNA splicing branchpoints , 2017, bioRxiv.

[27]  T. D. Schneider,et al.  Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. , 1992, Journal of molecular biology.

[28]  Byeong-Soo Jeong,et al.  Effective DNA Encoding for Splice Site Prediction Using SVM , 2014 .

[29]  Kenji Satou,et al.  DNA Sequence Classification by Convolutional Neural Network , 2016 .

[30]  Brendan J. Frey,et al.  Deep learning of the tissue-regulated splicing code , 2014, Bioinform..

[31]  Schraga Schwartz,et al.  Differential GC content between exons and introns establishes distinct strategies of splice-site recognition. , 2012, Cell reports.