MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery

The accurate mapping of reads that span splice junctions is a critical component of all analytic techniques that work with RNA-seq data. We introduce a second generation splice detection algorithm, MapSplice, whose focus is high sensitivity and specificity in the detection of splices as well as CPU and memory efficiency. MapSplice can be applied to both short (<75 bp) and long reads (≥75 bp). MapSplice is not dependent on splice site features or intron length, consequently it can detect novel canonical as well as non-canonical splices. MapSplice leverages the quality and diversity of read alignments of a given splice to increase accuracy. We demonstrate that MapSplice achieves higher sensitivity and specificity than TopHat and SpliceMap on a set of simulated RNA-seq data. Experimental studies also support the accuracy of the algorithm. Splice junctions derived from eight breast cancer RNA-seq datasets recapitulated the extensiveness of alternative splicing on a global level as well as the differences between molecular subtypes of breast cancer. These combined results indicate that MapSplice is a highly accurate algorithm for the alignment of RNA-seq reads to splice junctions. Software download URL: http://www.netlab.uky.edu/p/bioinfo/MapSplice.

[1]  J. Bell,et al.  Genomic structure of DNA encoding the lymphocyte homing receptor CD44 reveals at least 12 alternatively spliced exons. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[2]  M. Wigler,et al.  A conserved alternative splice in the von Recklinghausen neurofibromatosis (NF1) gene produces two neurofibromin isoforms, both of which have GTPase-activating protein activity , 1993, Molecular and cellular biology.

[3]  E. Jabs,et al.  FGFR2 exon IIIa and IIIc mutations in Crouzon, Jackson-Weiss, and Pfeiffer syndromes: evidence for missense changes, insertions, and a deletion due to alternative RNA splicing. , 1996, American journal of human genetics.

[4]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[5]  Haixu Tang,et al.  Splicing graphs and EST assembly problem , 2002, ISMB.

[6]  J. Castle,et al.  Genome-Wide Survey of Human Alternative Pre-mRNA Splicing with Exon Junction Microarrays , 2003, Science.

[7]  B. Frey,et al.  Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. , 2004, Molecular cell.

[8]  Tyson A. Clark,et al.  Nova regulates brain-specific splicing to shape the synapse , 2005, Nature Genetics.

[9]  Yi Xing,et al.  An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs , 2006, Nucleic acids research.

[10]  P. Pollock,et al.  Frequent activating FGFR2 mutations in endometrial carcinomas parallel germline mutations associated with craniosynostosis and skeletal dysplasia syndromes , 2007, Oncogene.

[11]  Gunnar Rätsch,et al.  Optimal spliced alignments of short sequence reads , 2008, BMC Bioinformatics.

[12]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[13]  Christopher J. Lee,et al.  Reconstruction of full-length isoforms from splice graphs. , 2008, Methods in molecular biology.

[14]  Jacek Majewski,et al.  Genome-wide analysis of transcript isoform variation in humans , 2008, Nature Genetics.

[15]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[16]  Ketil Malde,et al.  The effect of sequence quality on sequence alignment , 2008, Bioinform..

[17]  Kristian Cibulskis,et al.  Drug-sensitive FGFR2 mutations in endometrial carcinoma , 2008, Proceedings of the National Academy of Sciences.

[18]  Marcel H. Schulz,et al.  A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome , 2008, Science.

[19]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[20]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[21]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[22]  S. Nelson,et al.  BFAST: An Alignment Tool for Large Scale Genome Resequencing , 2009, PloS one.

[23]  Wing Hung Wong,et al.  Statistical inferences for isoform expression in RNA-Seq , 2009, Bioinform..

[24]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[25]  Inanç Birol,et al.  De novo transcriptome assembly with ABySS , 2009, Bioinform..

[26]  Martin Kircher,et al.  Improved base calling for the Illumina Genome Analyzer using machine learning strategies , 2009, Genome Biology.

[27]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[28]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[29]  Oliver Hofmann,et al.  ASTD: The Alternative Splicing and Transcript Diversity database. , 2009, Genomics.

[30]  A. Méreau,et al.  Analysis of splicing patterns by pyrosequencing , 2009, Nucleic acids research.

[31]  Liang Chen,et al.  A hierarchical Bayesian model for comparing transcriptomes at the individual transcript isoform level , 2009, Nucleic acids research.

[32]  Marcel H. Schulz,et al.  Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments , 2010, Nucleic acids research.

[33]  B. Blencowe,et al.  Regulation of Alternative Splicing by Histone Modifications , 2010, Science.

[34]  Y. Xing,et al.  Detection of splice junctions from paired-end RNA-seq data by SpliceMap , 2010, Nucleic acids research.

[35]  L. Di Marcotullio,et al.  The multiple functions of Numb. , 2010, Experimental cell research.