Scavenger: A pipeline for recovery of unaligned reads utilising similarity with aligned reads

Read alignment is an important step in RNA-seq analysis as the result of alignment forms the basis for downstream analyses. However, recent studies have shown that published alignment tools have variable mapping sensitivity and do not necessarily align all the reads which should have been aligned, a problem we termed as the false-negative non-alignment problem. Here we present Scavenger, a python-based bioinformatics pipeline for recovering unaligned reads using a novel mechanism in which a putative alignment location is discovered based on sequence similarity between aligned and unaligned reads. We showed that Scavenger could recover unaligned reads in a range of simulated and real RNA-seq datasets, including single-cell RNA-seq data. We found that recovered reads tend to contain more genetic variants with respect to the reference genome compared to previously aligned reads, indicating that divergence between personal and reference genomes plays a role in the false-negative non-alignment problem. Even when the number of recovered reads is relatively small compared to the total number of reads, the addition of these recovered reads can impact downstream analyses, especially in terms of estimating the expression and differential expression of lowly expressed genes, such as pseudogenes.

[1]  Huai Liu,et al.  An innovative approach for testing bioinformatics programs using metamorphic testing , 2009, BMC Bioinformatics.

[2]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[3]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[4]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[5]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[6]  Thomas M. Keane,et al.  Mouse genomic variation and its effect on phenotypes and gene regulation , 2011, Nature.

[7]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[8]  Brian P. Brunk,et al.  Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM) , 2011, Bioinform..

[9]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[10]  S. Dhanasekaran,et al.  Expressed Pseudogenes in the Transcriptional Landscape of Human Cancers , 2012, Cell.

[11]  Eric Rivals,et al.  CRAC: an integrated approach to the analysis of RNA-seq reads , 2013, Genome Biology.

[12]  W. Shi,et al.  The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote , 2013, Nucleic acids research.

[13]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[14]  Wei Shi,et al.  featureCounts: an efficient general purpose program for assigning sequence reads to genomic features , 2013, Bioinform..

[15]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[16]  Gregory R. Grant,et al.  Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data , 2015, Bioinform..

[17]  Xuefei Shi,et al.  Pseudogene-expressed RNAs: a new frontier in cancers , 2016, Tumor Biology.

[18]  Aleksandra A. Kolodziejczyk,et al.  The technology and biology of single-cell RNA sequencing. , 2015, Molecular cell.

[19]  O. Kohany,et al.  Repbase Update, a database of repetitive elements in eukaryotic genomes , 2015, Mobile DNA.

[20]  Thomas D. Wu,et al.  GMAP and GSNAP for Genomic Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality , 2016, Statistical Genomics.

[21]  Eun Ji Kim,et al.  Simulation-based comprehensive benchmarking of RNA-seq aligners , 2016, Nature Methods.

[22]  Nicolas Philippe,et al.  SimBA: A methodology and tools for evaluating the performance of RNA-Seq bioinformatic pipelines , 2017, BMC Bioinformatics.

[23]  Tsong Yueh Chen,et al.  Harnessing Multiple Source Test Cases in Metamorphic Testing: A Case Study in Bioinformatics , 2017, 2017 IEEE/ACM 2nd International Workshop on Metamorphic Testing (MET).

[24]  Wanseon Lee,et al.  AltHapAlignR: improved accuracy of RNA-seq analyses through the use of alternative haplotypes , 2018, Bioinform..

[25]  Ryan D. Hernandez,et al.  ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues , 2018, Genome Biology.

[26]  Tsong Yueh Chen,et al.  Metamorphic Testing: A New Approach for Generating Next Test Cases , 2020, ArXiv.