COSSMO: Predicting Competitive Alternative Splice Site Selection using Deep Learning

Motivation Alternative splice site selection is inherently competitive and the probability of a given splice site to be used also depends strongly on the strength of neighboring sites. Here we present a new model named Competitive Splice Site Model (COSSMO), which explicitly models these competitive effects and predict the PSI distribution over any number of putative splice sites. We model an alternative splicing event as the choice of a 3’ acceptor site conditional on a fixed upstream 5’ donor site, or the choice of a 5’ donor site conditional on a fixed 3’ acceptor site. We build four different architectures that use convolutional layers, communication layers, LSTMS, and residual networks, respectively, to learn relevant motifs from sequence alone. We also construct a new dataset from genome annotations and RNA-Seq read data that we use to train our model. Results COSSMO is able to predict the most frequently used splice site with an accuracy of 70% on unseen test data, and achieve an R2 of 60% in modeling the PSI distribution. We visualize the motifs that COSSMO learns from sequence and show that COSSMO recognizes the consensus splice site sequences as well as many known splicing factors with high specificity. Availability Our dataset is available from http://cossmo.deepgenomics.com. Contact frey@deepgenomics.com Supplementary information Supplementary data are available at Bioinformatics online.

[1]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[2]  Michael Ruogu Zhang,et al.  Statistical features of human exons and their flanking regions. , 1998, Human molecular genetics.

[3]  Brendan J. Frey,et al.  Deciphering the splicing code , 2010, Nature.

[4]  L. Shkreta,et al.  hnRNP proteins and splicing control. , 2007, Advances in experimental medicine and biology.

[5]  Brendan J. Frey,et al.  Deep learning of the tissue-regulated splicing code , 2014, Bioinform..

[6]  J. G. Patton,et al.  Cloning and characterization of PSF, a novel pre-mRNA splicing factor. , 1993, Genes & development.

[7]  Douglas G Scofield,et al.  Intron size, abundance, and distribution within untranslated regions of genes. , 2006, Molecular biology and evolution.

[8]  Christopher B. Burge,et al.  Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals , 2003, RECOMB '03.

[9]  Michael R. Green,et al.  Cloning and domain structure of the mammalian splicing factor U2AF , 1992, Nature.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  M. Frilander,et al.  The significant other: splicing by the minor spliceosome , 2012, Wiley interdisciplinary reviews. RNA.

[12]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[13]  B. Frey,et al.  Probabilistic estimation of short sequence expression using RNA-Seq data and the “positional bootstrap” , 2016, bioRxiv.

[14]  B. Frey,et al.  The human splicing code reveals new insights into the genetic determinants of disease , 2015, Science.

[15]  Brendan J. Frey,et al.  A compendium of RNA-binding motifs for decoding gene regulation , 2013, Nature.

[16]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[17]  C. Burge,et al.  Splicing regulation: from a parts list of regulatory elements to an integrated splicing code. , 2008, RNA.

[18]  David R. Kelley,et al.  Sequential regulatory activity prediction across chromosomes with convolutional neural networks. , 2018, Genome research.

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[20]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[21]  M. Carmo-Fonseca,et al.  Deep intronic mutations and human disease , 2017, Human Genetics.

[22]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[23]  M. Swanson,et al.  RNA mis-splicing in disease , 2015, Nature Reviews Genetics.

[24]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[25]  Rob Fergus,et al.  Learning Multiagent Communication with Backpropagation , 2016, NIPS.

[26]  Anke Busch,et al.  Splicing predictions reliably classify different types of alternative splicing , 2015, RNA.

[27]  Zefeng Wang,et al.  The splicing activator DAZAP1 integrates splicing control into MEK/Erk regulated cell proliferation and migration , 2014, Nature Communications.

[28]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[29]  Brendan J. Frey,et al.  Bayesian prediction of tissue-regulated splicing using RNA sequence and cellular context , 2011, Bioinform..