CONSENT: Scalable self-correction of long reads with multiple sequence alignment

Motivation: Third generation sequencing technologies such as Pacific Biosciences and Oxford Nanopore allow the sequencing of long reads of tens of kbs, that are expected to solve various problems, such as contig and haplotype assembly, scaffolding, and structural variant calling. However, they also reach high error rates of 10 to 30%, and thus require efficient error correction. As first long reads sequencing experiments produced reads displaying error rates higher than 15% on average, most methods relied on the complementary use of short reads data to perform correction, in a hybrid approach. However, these sequencing technologies evolve fast, and the error rate of the long reads is now capped at around 10-12%. As a result, self-correction is now frequently used as a first step of third generation sequencing data analysis projects. As of today, efficient tools allowing to perform self-correction of the long reads are available, and recent observations suggest that avoiding the use of second generation sequencing reads could bypass their inherent bias. Results: We introduce CONSENT, a new method for the self-correction of long reads that combines different strategies from the state-of-the-art. A multiple sequence alignment strategy is thus combined to the use of local de Bruijn graphs. Moreover, the multiple sequence alignment benefits from an efficient segmentation strategy based on k-mers chaining, allowing to greatly reduce its time footprint. Our experiments show that CONSENT compares well to the latest state-of-the-art self-correction methods, and even outperforms them on real Oxford Nanopore datasets. In particular, they show that CONSENT is the only method able to scale to a human dataset containing Oxford Nanopore ultra-long reads, reaching lengths up to 340 kbp. Availability and implementation: CONSENT is implemented is C++, supported on Linux platforms and freely available at https://github.com/morispi/CONSENT. Contact: pierre.morisse2@univ-rouen.fr

[1]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[2]  Feng Luo,et al.  MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads , 2017, Nature Methods.

[3]  Minh Duc Cao,et al.  Scaffolding and completing genome assemblies in real-time with nanopore sequencing , 2016, Nature Communications.

[4]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[5]  Piet Demeester,et al.  Jabba: hybrid error correction for long sequencing reads , 2015, Algorithms for Molecular Biology.

[6]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[7]  Stefan Engelen,et al.  Genome assembly using Nanopore-guided long and error-free DNA reads , 2015, BMC Genomics.

[8]  Eugene W. Myers,et al.  Efficient Local Alignment Discovery amongst Noisy Long Reads , 2014, WABI.

[9]  Fritz J Sedlazeck,et al.  Piercing the dark matter: bioinformatics of long-range sequencing and mapping , 2018, Nature Reviews Genetics.

[10]  D. Branton,et al.  Nanopore Sequencing , 2019 .

[11]  Sven Rahmann,et al.  SimLoRD: Simulation of Long Read Data , 2016, Bioinform..

[12]  Ilan Shomorony,et al.  HINGE: Long-Read Assembly Achieves Optimal Repeat Resolution , 2016, bioRxiv.

[13]  M. Elloumi,et al.  An Error Correction and DeNovo Assembly Approach for Nanopore Reads Using Short Reads , 2017 .

[14]  Faraz Hach,et al.  CoLoRMap: Correcting Long Reads by Mapping short reads , 2016, Bioinform..

[15]  Cédric Chauve,et al.  LRCstats, a tool for evaluating long reads correction methods , 2017, Bioinform..

[16]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[17]  Timothy P. L. Smith,et al.  Reducing assembly complexity of microbial genomes with single-molecule sequencing , 2013, Genome Biology.

[18]  Michael C. Schatz,et al.  Accurate detection of complex structural variations using single molecule sequencing , 2017, Nature Methods.

[19]  Leo van Iersel,et al.  WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads , 2015, J. Comput. Biol..

[20]  Olivia Choudhury,et al.  HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning , 2017, Scientific Reports.

[21]  Eugene W. Myers,et al.  Non Hybrid Long Read Consensus Using Local De Bruijn Graph Assembly , 2017, bioRxiv.

[22]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[23]  Arnaud Lefebvre,et al.  Hybrid correction of highly noisy long reads using a variable‐order de Bruijn graph , 2018, Bioinform..

[24]  Dandan Song,et al.  FLAS: fast and high-throughput algorithm for PacBio long-read self-correction , 2019, Bioinform..

[25]  S. Koren,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, bioRxiv.

[26]  Ergude Bao,et al.  HALC: High throughput algorithm for long read error correction , 2017, BMC Bioinformatics.

[27]  Niranjan Nagarajan,et al.  Fast and accurate de novo genome assembly from long uncorrected reads. , 2017, Genome research.

[28]  Dmitry Antipov,et al.  Versatile genome assembly evaluation with QUAST-LG , 2018, Bioinform..

[29]  Leonard McMillan,et al.  FMLRC: Hybrid long read error correction using an FM-index , 2018, BMC Bioinformatics.

[30]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[31]  C. Alkan,et al.  Hercules: a profile HMM-based hybrid error correction algorithm for long reads , 2017, bioRxiv.