SpliceVec: distributed feature representations for splice junction prediction

Identification of intron boundaries, called splice junctions, is an important part of delineating gene structure and functions. This also provides valuable insights into the role of alternative splicing in increasing functional diversity of genes. Identification of splice junctions through RNA-seq is by mapping short reads to the reference genome which is prone to errors due to random sequence matches. This encourages identification of splicing junctions through computational methods based on machine learning. Existing models are dependent on feature extraction and selection for capturing splicing signals lying in the vicinity of splice junctions. But such manually extracted features are not exhaustive. We introduce distributed feature representation, SpliceVec, to avoid explicit and biased feature extraction generally adopted for such tasks. SpliceVec is based on two widely used distributed representation models in natural language processing. Learned feature representation in form of SpliceVec is fed to multilayer perceptron for splice junction classification task. An intrinsic evaluation of SpliceVec indicates that it is able to group true and false sites distinctly. Our study on optimal context to be considered for feature extraction indicates inclusion of entire intronic sequence to be better than flanking upstream and downstream region around splice junctions. Further, SpliceVec is invariant to canonical and non-canonical splice junction detection. The proposed model is consistent in its performance even with reduced dataset and class-imbalanced dataset. SpliceVec is computationally efficient and can be trained with user-defined data as well.

[1]  Lorenza Vitale,et al.  Identification of minimal eukaryotic introns through GeneBase, a user-friendly tool for parsing the NCBI Gene databank , 2015, DNA research : an international journal for rapid publication of reports on genes and genomes.

[2]  Yvan Saeys,et al.  SpliceMachine: predicting splice sites from high-dimensional local context representations , 2005, Bioinform..

[3]  Yi Zhang,et al.  DeepSplice: Deep classification of novel splice junctions revealed by RNA-seq , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[4]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[5]  Sungroh Yoon,et al.  Boosted Categorical Restricted Boltzmann Machine for Computational Prediction of Splice Junctions , 2015, ICML.

[6]  James M. Hogan,et al.  Distributed Representations for Biological Sequence Analysis , 2016, ArXiv.

[7]  Yael Mandel-Gutfreund,et al.  Does distance matter? Variations in alternative 3′ splicing regulation , 2007, Nucleic acids research.

[8]  Noam Shomron,et al.  MicroRNA-Biogenesis and Pre-mRNA Splicing Crosstalk , 2009, Journal of biomedicine & biotechnology.

[9]  J. Huang,et al.  An approach of encoding for prediction of splice sites using SVM. , 2006, Biochimie.

[10]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[11]  Stephen M. Mount,et al.  Genomic sequence, splicing, and gene annotation. , 2000, American journal of human genetics.

[12]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[13]  Saman K. Halgamuge,et al.  Splice site identification using probabilistic parameters and SVM classification , 2006 .

[14]  C. Burge,et al.  Splicing regulation: from a parts list of regulatory elements to an integrated splicing code. , 2008, RNA.

[15]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[16]  R. Sorek,et al.  Intronic sequences flanking alternatively spliced exons are conserved between human and mouse. , 2003, Genome research.

[17]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[18]  Patrick Ng,et al.  dna2vec: Consistent vector representations of variable-length k-mers , 2017, ArXiv.

[19]  Reto Guler,et al.  Targeting Batf2 for infectious diseases and cancer , 2015, Oncotarget.

[20]  V. Solovyev,et al.  Analysis of canonical and non-canonical splice sites in mammalian genomes. , 2000, Nucleic acids research.

[21]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[22]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[23]  Bernard De Baets,et al.  Feature subset selection for splice site prediction , 2002, ECCB.

[24]  Yvan Saeys,et al.  Digging into Acceptor Splice Site Prediction: An Iterative Feature Selection Approach , 2004, PKDD.

[25]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[26]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[27]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[28]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Gunnar Rätsch,et al.  Accurate splice site prediction using support vector machines , 2007, BMC Bioinformatics.

[31]  Yvan Saeys,et al.  Selecting Relevant Features for Splice Site Prediction by Estimation of Distribution Algorithms. , 2002 .

[32]  Y. Xing,et al.  Detection of splice junctions from paired-end RNA-seq data by SpliceMap , 2010, Nucleic acids research.

[33]  A. Cianferoni,et al.  The importance of TSLP in allergic disease and its role as a potential therapeutic target , 2014, Expert review of clinical immunology.

[34]  Lise Getoor,et al.  A Feature Generation Algorithm for Sequences with Application to Splice-Site Prediction , 2006, PKDD.

[35]  Yi Xing,et al.  Genetic variation of pre‐mRNA alternative splicing in human populations , 2012, Wiley interdisciplinary reviews. RNA.

[36]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.