Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data

MOTIVATION Nanopore long-read sequencing technology offers promising alternatives to high-throughput short read sequencing, especially in the context of RNA-sequencing. However this technology is currently hindered by high error rates in the output data that affect analyses such as the identification of isoforms, exon boundaries, open reading frames and creation of gene catalogues. Due to the novelty of such data, computational methods are still actively being developed and options for the error correction of Nanopore RNA-sequencing long reads remain limited. RESULTS In this article, we evaluate the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads. We provide an automatic and extensive benchmark tool that not only reports classical error correction metrics but also the effect of correction on gene families, isoform diversity, bias toward the major isoform and splice site detection. We find that long read error correction tools that were originally developed for DNA are also suitable for the correction of Nanopore RNA-sequencing data, especially in terms of increasing base pair accuracy. Yet investigators should be warned that the correction process perturbs gene family sizes and isoform diversity. This work provides guidelines on which (or whether) error correction tools should be used, depending on the application type. BENCHMARKING SOFTWARE https://gitlab.com/leoisl/LR_EC_analyser.

[1]  Niranjan Nagarajan,et al.  Fast and sensitive mapping of nanopore sequencing reads with GraphMap , 2016, Nature Communications.

[2]  Fritz J Sedlazeck,et al.  Piercing the dark matter: bioinformatics of long-range sequencing and mapping , 2018, Nature Reviews Genetics.

[3]  Aaron R. Quinlan,et al.  Poretools: a toolkit for analyzing nanopore sequence data , 2014, bioRxiv.

[4]  Li Tong,et al.  Evaluating the impact of sequencing error correction for RNA-seq data with ERCC RNA spike-in controls , 2016, 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI).

[5]  Arnaud Lefebvre,et al.  ELECTOR: evaluator for long reads correction methods , 2019, bioRxiv.

[6]  Eugene W. Myers,et al.  Non Hybrid Long Read Consensus Using Local De Bruijn Graph Assembly , 2017, bioRxiv.

[7]  Mourad Elloumi,et al.  Efficient Hybrid De Novo Error Correction and Assembly for Long Reads , 2016, 2016 27th International Workshop on Database and Expert Systems Applications (DEXA).

[8]  Thomas Hackl,et al.  proovread: large-scale high-accuracy PacBio correction through iterative short read consensus , 2014, Bioinform..

[9]  Niranjan Nagarajan,et al.  INC-Seq: accurate single molecule reads using nanopore sequencing , 2016, bioRxiv.

[10]  Liliana Florea,et al.  Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads , 2015, GigaScience.

[11]  Ruifeng Hu,et al.  LSCplus: a fast solution for improving long read accuracy by short read alignment , 2016, BMC Bioinformatics.

[12]  Meena Kishore Sakharkar,et al.  Distributions of exons and introns in the human genome , 2004, Silico Biol..

[13]  Shilin Chen,et al.  IDP-denovo: de novo transcriptome assembly and isoform annotation by hybrid sequencing , 2018, Bioinform..

[14]  Esko Ukkonen,et al.  Accurate self-correction of errors in long reads using de Bruijn graphs , 2016, Bioinform..

[15]  Piet Demeester,et al.  Jabba: hybrid error correction for long sequencing reads , 2015, Algorithms for Molecular Biology.

[16]  Feng Luo,et al.  MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads , 2017, Nature Methods.

[17]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[18]  Junqi He,et al.  NHERF1 Enhances Cisplatin Sensitivity in Human Cervical Cancer Cells , 2017, International journal of molecular sciences.

[19]  Mark Akeson,et al.  Nanopore Long-Read RNAseq Reveals Widespread Transcriptional Variation Among the Surface Receptors of Individual B cells , 2017 .

[20]  Thomas D. Wu,et al.  GMAP: a genomic mapping and alignment program for mRNA and EST sequence , 2005, Bioinform..

[21]  Ergude Bao,et al.  HALC: High throughput algorithm for long read error correction , 2017, BMC Bioinformatics.

[22]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[23]  Sara Goodwin,et al.  Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome , 2015, bioRxiv.

[24]  Zhong Wang,et al.  Next-generation transcriptome assembly , 2011, Nature Reviews Genetics.

[25]  Tyson A. Clark,et al.  Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing , 2015, Nucleic acids research.

[26]  H. Schwender,et al.  Validation of Splicing Events in Transcriptome Sequencing Data , 2017, International journal of molecular sciences.

[27]  Xun Xu,et al.  SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads , 2013, Bioinform..

[28]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[29]  Tyson A. Clark,et al.  Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing , 2016, Nature Communications.

[30]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature Biotechnology.

[31]  Kresimir Krizanovic,et al.  Evaluation of tools for long read RNA-seq splice-aware alignment , 2017, bioRxiv.

[32]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature biotechnology.

[33]  B. Haas,et al.  Advancing RNA-Seq analysis , 2010, Nature Biotechnology.

[34]  Gabor T. Marth,et al.  SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications , 2012, PloS one.

[35]  Stefan Engelen,et al.  Genome assembly using Nanopore-guided Long and Error-free DNA reads , 2015 .

[36]  Dominique Lavenier,et al.  Evaluation of long read error correction software , 2017 .

[37]  Kateryna D. Makova,et al.  Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon , 2018, Nature Communications.

[38]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[39]  Carl Kingsford,et al.  Accurate assembly of transcripts through phase-preserving graph decomposition , 2017, Nature Biotechnology.

[40]  Paolo Piazza,et al.  Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis , 2017, F1000Research.

[41]  Richard Mott,et al.  EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA , 1997, Comput. Appl. Biosci..

[42]  Cédric Chauve,et al.  LRCstats, a tool for evaluating long reads correction methods , 2017, Bioinform..

[43]  S. Salzberg,et al.  StringTie enables improved reconstruction of a transcriptome from RNA-seq reads , 2015, Nature Biotechnology.

[44]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[45]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[46]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[47]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[48]  Nanopore native RNA sequencing of a human poly(A) transcriptome , 2019, Nature Methods.

[49]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[50]  Mile Šikić,et al.  Fast and accurate de novo genome assembly from long uncorrected reads , 2016, bioRxiv.

[51]  Arnaud Lefebvre,et al.  Hybrid correction of highly noisy long reads using a variable‐order de Bruijn graph , 2018, Bioinform..

[52]  Scott J. Emrich,et al.  HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning , 2017 .

[53]  N. Loman,et al.  A complete bacterial genome assembled de novo using only nanopore sequencing data , 2015, Nature Methods.

[54]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[55]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[56]  Steven J. M. Jones,et al.  De novo assembly and analysis of RNA-seq data , 2010, Nature Methods.

[57]  W. Wong,et al.  Improving PacBio Long Read Accuracy by Short Read Alignment , 2012, PloS one.

[58]  David L Adelson,et al.  Long read reference genome-free reconstruction of a full-length transcriptome from Astragalus membranaceus reveals transcript variants involved in bioactive compound biosynthesis , 2017, Cell Discovery.

[59]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[60]  Jiannis Ragoussis,et al.  Benchmarking of the Oxford Nanopore MinION sequencing for quantitative and qualitative assessment of cDNA populations , 2016, Scientific Reports.

[61]  Kin Fai Au,et al.  A comparative evaluation of hybrid error correction methods for error-prone long reads , 2019, Genome Biology.