Comparative assessment of long-read error-correction software applied to RNA-sequencing data

Motivation Long-read sequencing technologies offer promising alternatives to high-throughput short read sequencing, especially in the context of RNA-sequencing. However these technologies are currently hindered by high error rates in the output data that affect analyses such as the identification of isoforms, exon boundaries, open reading frames, and the creation of gene catalogues. Due to the novelty of such data, computational methods are still actively being developed and options for the error-correction of RNA-sequencing long reads remain limited. Results In this article, we evaluate the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads. We provide an automatic and extensive benchmark tool that not only reports classical error-correction metrics but also the effect of correction on gene families, isoform diversity, bias towards the major isoform, and splice site detection. We find that long read error-correction tools that were originally developed for DNA are also suitable for the correction of RNA-sequencing data, especially in terms of increasing base-pair accuracy. Yet investigators should be warned that the correction process perturbs gene family sizes and isoform diversity. This work provides guidelines on which (or whether) error-correction tools should be used, depending on the application type. Benchmarking software https://gitlab.com/leoisl/LR_EC_analyser

[1]  Michael C. Schatz,et al.  Oxford Nanopore Sequencing, Hybrid Error Correction, and de novo Assembly of a Eukaryotic Genome , 2015 .

[2]  Arnaud Lefebvre,et al.  Hybrid correction of highly noisy long reads using a variable‐order de Bruijn graph , 2018, Bioinform..

[3]  N. Loman,et al.  A complete bacterial genome assembled de novo using only nanopore sequencing data , 2015, Nature Methods.

[4]  Stefan Engelen,et al.  Genome assembly using Nanopore-guided long and error-free DNA reads , 2015, BMC Genomics.

[5]  H. Schwender,et al.  Validation of Splicing Events in Transcriptome Sequencing Data , 2017, International journal of molecular sciences.

[6]  Xun Xu,et al.  SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads , 2013, Bioinform..

[7]  Hugh E. Olsen,et al.  Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells , 2017, Nature Communications.

[8]  Esko Ukkonen,et al.  Accurate self-correction of errors in long reads using de Bruijn graphs , 2016, Bioinform..

[9]  Kresimir Krizanovic,et al.  Evaluation of tools for long read RNA-seq splice-aware alignment , 2017, bioRxiv.

[10]  Richard Mott,et al.  EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA , 1997, Comput. Appl. Biosci..

[11]  Cédric Chauve,et al.  LRCstats, a tool for evaluating long reads correction methods , 2017, Bioinform..

[12]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[13]  Aaron R. Quinlan,et al.  Poretools: a toolkit for analyzing nanopore sequence data , 2014, bioRxiv.

[14]  Eugene W. Myers,et al.  Non Hybrid Long Read Consensus Using Local De Bruijn Graph Assembly , 2017, bioRxiv.

[15]  Angela N. Brooks,et al.  Nanopore native RNA sequencing of a human poly(A) transcriptome , 2018, bioRxiv.

[16]  Niranjan Nagarajan,et al.  Fast and sensitive mapping of nanopore sequencing reads with GraphMap , 2016, Nature Communications.

[17]  Fritz J Sedlazeck,et al.  Piercing the dark matter: bioinformatics of long-range sequencing and mapping , 2018, Nature Reviews Genetics.

[18]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[19]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature Biotechnology.

[20]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[21]  Sharon R. Smith,et al.  Fast and Sensitive , 2017, Pediatric emergency care.

[22]  S. Salzberg,et al.  StringTie enables improved reconstruction of a transcriptome from RNA-seq reads , 2015, Nature Biotechnology.

[23]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[24]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[25]  Steven J. M. Jones,et al.  De novo assembly and analysis of RNA-seq data , 2010, Nature Methods.

[26]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[27]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature biotechnology.

[28]  B. Haas,et al.  Advancing RNA-Seq analysis , 2010, Nature Biotechnology.

[29]  Gabor T. Marth,et al.  SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications , 2012, PloS one.

[30]  Li Tong,et al.  Evaluating the impact of sequencing error correction for RNA-seq data with ERCC RNA spike-in controls , 2016, 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI).

[31]  Paolo Piazza,et al.  Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis , 2017, F1000Research.

[32]  Anantharaman Kalyanaraman,et al.  Genome Assembly , 2011, Encyclopedia of Parallel Computing.

[33]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[34]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[35]  Thomas Hackl,et al.  proovread: large-scale high-accuracy PacBio correction through iterative short read consensus , 2014, Bioinform..

[36]  Feng Luo,et al.  MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads , 2017, Nature Methods.

[37]  Olivia Choudhury,et al.  HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning , 2017, Scientific Reports.

[38]  Arnaud Lefebvre,et al.  ELECTOR: evaluator for long reads correction methods , 2019, bioRxiv.

[39]  Tyson A. Clark,et al.  Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing , 2016, Nature Communications.

[40]  Zhong Wang,et al.  Next-generation transcriptome assembly , 2011, Nature Reviews Genetics.

[41]  Tyson A. Clark,et al.  Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing , 2015, Nucleic acids research.

[42]  W. Wong,et al.  Improving PacBio Long Read Accuracy by Short Read Alignment , 2012, PloS one.

[43]  David L Adelson,et al.  Long read reference genome-free reconstruction of a full-length transcriptome from Astragalus membranaceus reveals transcript variants involved in bioactive compound biosynthesis , 2017, Cell Discovery.

[44]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[45]  Niranjan Nagarajan,et al.  Fast and accurate de novo genome assembly from long uncorrected reads. , 2017, Genome research.

[46]  Mourad Elloumi,et al.  Efficient Hybrid De Novo Error Correction and Assembly for Long Reads , 2016, 2016 27th International Workshop on Database and Expert Systems Applications (DEXA).

[47]  Liliana Florea,et al.  Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads , 2015, GigaScience.

[48]  Meena Kishore Sakharkar,et al.  Distributions of exons and introns in the human genome , 2004, Silico Biol..

[49]  Shilin Chen,et al.  IDP-denovo: de novo transcriptome assembly and isoform annotation by hybrid sequencing , 2018, Bioinform..

[50]  Ergude Bao,et al.  HALC: High throughput algorithm for long read error correction , 2017, BMC Bioinformatics.

[51]  Niranjan Nagarajan,et al.  INC-Seq: accurate single molecule reads using nanopore sequencing , 2016, bioRxiv.

[52]  Ruifeng Hu,et al.  LSCplus: a fast solution for improving long read accuracy by short read alignment , 2016, BMC Bioinformatics.

[53]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[54]  Junqi He,et al.  NHERF1 Enhances Cisplatin Sensitivity in Human Cervical Cancer Cells , 2017, International journal of molecular sciences.

[55]  Thomas D. Wu,et al.  GMAP: a genomic mapping and alignment program for mRNA and EST sequence , 2005, Bioinform..

[56]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[57]  Jiannis Ragoussis,et al.  Benchmarking of the Oxford Nanopore MinION sequencing for quantitative and qualitative assessment of cDNA populations , 2016, Scientific Reports.

[58]  Kin Fai Au,et al.  A comparative evaluation of hybrid error correction methods for error-prone long reads , 2019, Genome Biology.

[59]  Dominique Lavenier,et al.  Evaluation of long read error correction software , 2017 .

[60]  Kateryna D. Makova,et al.  Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon , 2018, Nature Communications.

[61]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[62]  Carl Kingsford,et al.  Accurate assembly of transcripts through phase-preserving graph decomposition , 2017, Nature Biotechnology.

[63]  Piet Demeester,et al.  Jabba: hybrid error correction for long sequencing reads , 2015, Algorithms for Molecular Biology.