SpliceJumper: a classification-based approach for calling splicing junctions from RNA-seq data

BackgroundNext-generation RNA sequencing technologies have been widely applied in transcriptome profiling. This facilitates further studies of gene structure and expression on the genome wide scale. It is an important step to align reads to the reference genome and call out splicing junctions for the following analysis, such as the analysis of alternative splicing and isoform construction. However, because of the existence of introns, when RNA-seq reads are aligned to the reference genome, reads can not be fully mapped at splicing sites. Thus, it is challenging to align reads and call out splicing junctions accurately.ResultsIn this paper, we present a classification based approach for calling splicing junctions from RNA-seq data, which is implemented in the program SpliceJumper. SpliceJumper uses a machine learning approach which combines multiple features extracted from RNA-seq data. We compare SpliceJumper with two existing RNA-seq analysis approaches, TopHat2 and MapSplice2, on both simulated and real data. Our results show that SpliceJumper outperforms TopHat2 and MapSplice2 in accuracy. The program SpliceJumper can be downloaded at https://github.com/Reedwarbler/SpliceJumper.

[1]  D. Black Mechanisms of alternative pre-messenger RNA splicing. , 2003, Annual review of biochemistry.

[2]  Guey-Shin Wang,et al.  Splicing in disease: disruption of the splicing code and the decoding machinery , 2007, Nature Reviews Genetics.

[3]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[4]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[5]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[6]  Giorgio Valle,et al.  PASS: a program to align short sequences , 2009, Bioinform..

[7]  Vipin T. Sreedharan,et al.  RNA‐Seq Read Alignments with PALMapper , 2010, Current protocols in bioinformatics.

[8]  Serban Nacu,et al.  Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[9]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  M. Wilkins,et al.  Whole Transcriptome Sequencing Reveals Gene Expression and Splicing Differences in Brain Regions Affected by Alzheimer's Disease , 2011, PloS one.

[12]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[13]  Brian P. Brunk,et al.  Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM) , 2011, Bioinform..

[14]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[15]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[16]  Roderic Guigó,et al.  The GEM mapper: fast, accurate and versatile alignment by filtration , 2012, Nature Methods.

[17]  Kai Ye,et al.  PASSion: a pattern growth algorithm-based pipeline for splice junction detection in paired-end RNA-Seq data , 2012, Bioinform..

[18]  Hugo Y. K. Lam,et al.  Personal Omics Profiling Reveals Dynamic Molecular and Medical Phenotypes , 2012, Cell.

[19]  M. Borodovsky,et al.  TrueSight: a new algorithm for splice junction detection using RNA-seq , 2012, Nucleic acids research.

[20]  J. Harrow,et al.  Systematic evaluation of spliced alignment programs for RNA-seq data , 2013, Nature Methods.

[21]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[22]  Yufeng Wu,et al.  GINDEL: Accurate Genotype Calling of Insertions and Deletions from Low Coverage Population Sequence Reads , 2014, PloS one.