RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq reads to transcriptomes

Motivation: The alignment of sequencing reads to a transcriptome is a common and important step in many RNA-seq analysis tasks. When aligning RNA-seq reads directly to a transcriptome (as is common in the de novo setting or when a trusted reference annotation is available), care must be taken to report the potentially large number of multi-mapping locations per read. This can pose a substantial computational burden for existing aligners, and can considerably slow downstream analysis. Results: We introduce a novel concept, quasi-mapping, and an efficient algorithm implementing this approach for mapping sequencing reads to a transcriptome. By attempting only to report the potential loci of origin of a sequencing read, and not the base-to-base alignment by which it derives from the reference, RapMap—our tool implementing quasi-mapping—is capable of mapping sequencing reads to a target transcriptome substantially faster than existing alignment tools. The algorithm we use to implement quasi-mapping uses several efficient data structures and takes advantage of the special structure of shared sequence prevalent in transcriptomes to rapidly provide highly-accurate mapping information. We demonstrate how quasi-mapping can be successfully applied to the problems of transcript-level quantification from RNA-seq reads and the clustering of contigs from de novo assembled transcriptomes into biologically meaningful groups. Availability and implementation: RapMap is implemented in C ++11 and is available as open-source software, under GPL v3, at https://github.com/COMBINE-lab/RapMap. Contact: rob.patro@cs.stonybrook.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[2]  Lior Pachter,et al.  Near-optimal RNA-Seq quantification , 2015, ArXiv.

[3]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[4]  A. Kristjuhan,et al.  Elongator interactions with nascent mRNA revealed by RNA immunoprecipitation. , 2004, Molecular cell.

[5]  Sven Rahmann,et al.  Building and Documenting Workflows with Python-Based Snakemake , 2012, GCB.

[6]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[7]  Wei Wang,et al.  RNA-Skim: a rapid method for RNA-Seq quantification at transcript level , 2014, Bioinform..

[8]  Alessandro Vullo,et al.  Ensembl 2015 , 2014, Nucleic Acids Res..

[9]  Kevin S. Smith,et al.  High-Resolution Transcriptome Analysis with Long-Read RNA Sequencing , 2014, PloS one.

[10]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[11]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[12]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[13]  Masao Nagasaki,et al.  TIGAR2: sensitive and accurate estimation of transcript isoform expression with longer RNA-Seq reads , 2014, BMC Genomics.

[14]  I. Nookaew,et al.  A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae , 2012, Nucleic acids research.

[15]  A. Oshlack,et al.  Corset: enabling differential gene expression analysis for de novo assembled transcriptomes , 2014, Genome Biology.

[16]  Geet Duggal,et al.  Salmon: Accurate, Versatile and Ultrafast Quantification from RNA-seq Data using Lightweight-Alignment , 2015 .

[17]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[18]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[19]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[20]  Ion I. Mandoiu,et al.  Estimation of alternative splicing isoform frequencies from RNA-Seq data , 2010, Algorithms for Molecular Biology.

[21]  David G Hendrickson,et al.  Differential analysis of gene regulation at transcript resolution with RNA-seq , 2012, Nature Biotechnology.

[22]  W. Shi,et al.  The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote , 2013, Nucleic acids research.

[23]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[24]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[25]  S. Dongen A cluster algorithm for graphs , 2000 .

[26]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[27]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[28]  Masao Nagasaki,et al.  TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference , 2013, Bioinform..

[29]  Robert Patro,et al.  Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[30]  R. Guigó,et al.  Modelling and simulating generic RNA-Seq experiments with the flux simulator , 2012, Nucleic acids research.

[31]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[32]  Sebastiano Vigna,et al.  Broadword Implementation of Rank/Select Queries , 2008, WEA.

[33]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[34]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[35]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[36]  Colin N. Dewey,et al.  RNA-Seq gene expression estimation with read mapping uncertainty , 2009, Bioinform..

[37]  Lucian Ilie,et al.  The longest common extension problem revisited and applications to approximate string searching , 2010, J. Discrete Algorithms.

[38]  Faraz Hach,et al.  mrsFAST: a cache-oblivious algorithm for short-read mapping , 2010, Nature Methods.

[39]  Jürg Bähler,et al.  Proportionality: A Valid Alternative to Correlation for Relative Data , 2014, bioRxiv.