Ratatosk: hybrid error correction of long reads enables accurate variant calling and assembly

Motivation Long Read Sequencing (LRS) technologies are becoming essential to complement Short Read Sequencing (SRS) technologies for routine whole genome sequencing. LRS platforms produce DNA fragment reads, from 103 to 106 bases, allowing the resolution of numerous uncertainties left by SRS reads for genome reconstruction and analysis. In particular, LRS characterizes long and complex structural variants undetected by SRS due to short read length. Furthermore, assemblies produced with LRS reads are considerably more contiguous than with SRS while spanning previously inaccessible telomeric and centromeric regions. However, a major challenge to LRS reads adoption is their much higher error rate than SRS of up to 15%, introducing obstacles in downstream analysis pipelines. Results We present Ratatosk, a new error correction method for erroneous long reads based on a compacted and colored de Bruijn graph built from accurate short reads. Short and long reads color paths in the graph while vertices are annotated with candidate Single Nucleotide Polymorphisms. Long reads are subsequently anchored to the graph using exact and inexact fc-mer matches to find paths corresponding to corrected sequences. We demonstrate that Ratatosk can reduce the raw error rate of Oxford Nanopore reads 6-fold on average with a median error rate as low as 0.28%. Ratatosk corrected data maintain nearly 99% accurate SNP calls and increase indel call accuracy by up to about 40% compared to the raw data. An assembly of the Ashkenazi individual HG002 created from Ratatosk corrected Oxford Nanopore reads yields a contig N50 of 43.22 Mbp and less misassemblies than an assembly created from PacBio HiFi reads. Availability https://github.com/DecodeGenetics/Ratatosk Contact guillaume.holley@decode.is

[1]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[2]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[3]  V. Bansal,et al.  Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing , 2019, Nature Communications.

[4]  P. Pevzner,et al.  centroFlye: Assembling Centromeres with Long Error-Prone Reads , 2019, bioRxiv.

[5]  Arnaud Lefebvre,et al.  ELECTOR: evaluator for long reads correction methods , 2019, bioRxiv.

[6]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[7]  W. Kloosterman,et al.  From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy , 2018, Genome Biology.

[8]  William Jones,et al.  Variation graph toolkit improves read mapping by representing genetic variation in the reference , 2018, Nature Biotechnology.

[9]  Piet Demeester,et al.  Jabba: Hybrid Error Correction for Long Sequencing Reads Using Maximal Exact Matches , 2015, WABI.

[10]  Hannes P. Eggertsson,et al.  Whole genome characterization of sequence diversity of 15,220 Icelanders , 2017, Scientific Data.

[11]  Tsunglin Liu,et al.  Effects of GC Bias in Next-Generation-Sequencing Data on De Novo Genome Assembly , 2013, PloS one.

[12]  Owen Kaser,et al.  Better bitmap performance with Roaring bitmaps , 2014, Softw. Pract. Exp..

[13]  Chirag Jain,et al.  A comprehensive evaluation of long read error correction methods , 2020, BMC genomics.

[14]  Sergey Koren,et al.  Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome , 2019, Nature Biotechnology.

[15]  Fritz J Sedlazeck,et al.  Piercing the dark matter: bioinformatics of long-range sequencing and mapping , 2018, Nature Reviews Genetics.

[16]  Kunihiko Sadakane,et al.  Detecting Superbubbles in Assembly Graphs , 2013, WABI.

[17]  Kin Fai Au,et al.  A comparative evaluation of hybrid error correction methods for error-prone long reads , 2019, Genome Biology.

[18]  Piet Demeester,et al.  Jabba: hybrid error correction for long sequencing reads , 2015, Algorithms for Molecular Biology.

[19]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Leonard McMillan,et al.  FMLRC: Hybrid long read error correction using an FM-index , 2018, BMC Bioinformatics.

[21]  Sergey Koren,et al.  Merqury: reference-free quality and phasing assessment for genome assemblies , 2020, bioRxiv.

[22]  Sergey Koren,et al.  Telomere-to-telomere assembly of a complete human X chromosome , 2019, bioRxiv.

[23]  Páll Melsted,et al.  Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs , 2019, Genome Biology.

[24]  Michael Roberts,et al.  The MaSuRCA genome assembler , 2013, Bioinform..

[25]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[26]  Mitchell R. Vollger,et al.  HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads , 2020, bioRxiv.

[27]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[28]  Evan E. Eichler,et al.  Characterizing the Major Structural Variant Alleles of the Human Genome , 2019, Cell.

[29]  Jonathan Wood,et al.  Identifying and removing haplotypic duplication in primary genome assemblies , 2019, bioRxiv.

[30]  Pierre Marijon,et al.  yacrd and fpa: upstream tools for long-read genome assembly , 2019, bioRxiv.

[31]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[32]  Arnaud Lefebvre,et al.  Hybrid correction of highly noisy long reads using a variable‐order de Bruijn graph , 2018, Bioinform..

[33]  Pierre Peterlongo,et al.  DiscoSnp++: de novo detection of small variants from raw unassembled read set(s) , 2017, bioRxiv.

[34]  Sergey Koren,et al.  Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes , 2020, Nature Biotechnology.

[35]  Thomas Colthurst,et al.  A universal SNP and small-indel variant caller using deep neural networks , 2018, Nature Biotechnology.

[36]  Pierre Marijon,et al.  yacrd and fpa: upstream tools for long-read genome assembly. , 2020, Bioinformatics.

[37]  Srinivas Aluru,et al.  A comprehensive evaluation of long read error correction methods , 2019, BMC Genomics.

[38]  Evan E. Eichler,et al.  Long-read human genome sequencing and its applications , 2020, Nature Reviews Genetics.

[39]  Snædis Kristmundsdottir,et al.  popSTR2 enables clinical and population-scale genotyping of microsatellites , 2019, Bioinform..

[40]  Faraz Hach,et al.  Dynamic Alignment-Free and Reference-Free Read Compression , 2018, J. Comput. Biol..

[41]  Martin Sosic,et al.  Edlib: a C/C++ library for fast, exact sequence alignment using edit distance , 2016, bioRxiv.

[42]  Ruibang Luo,et al.  Exploring the limit of using a deep neural network on pileup data for germline variant calling , 2020, Nature Machine Intelligence.

[43]  Martin C. Frith,et al.  Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads , 2019, Genome Biology.

[44]  Michael C. Schatz,et al.  Accurate detection of complex structural variations using single molecule sequencing , 2017, Nature Methods.

[45]  William T. Harvey,et al.  A fully phased accurate assembly of an individual human genome , 2019, bioRxiv.

[46]  Faraz Hach,et al.  CoLoRMap: Correcting Long Reads by Mapping short reads , 2016, Bioinform..

[47]  Z. Ning,et al.  Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of GC-biased genomes , 2009, Nature Methods.

[48]  Glenn Hickey,et al.  Superbubbles, Ultrabubbles and Cacti , 2017, bioRxiv.

[49]  Birgit Funke,et al.  Best practices for benchmarking germline small-variant calls in human genomes , 2019, Nature Biotechnology.

[50]  Pierre Morisse,et al.  Long-read error correction: a survey and qualitative comparison , 2020, bioRxiv.

[51]  Guillaume Holley,et al.  Long read sequencing of 1,817 Icelanders provides insight into the role of structural variants in human disease , 2019, bioRxiv.

[52]  Sergey Koren,et al.  HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads , 2020, bioRxiv.

[53]  Karen H. Miga,et al.  Centromeric Satellite DNAs: Hidden Sequence Variation in the Human Population , 2019, Genes.

[54]  Gil McVean,et al.  Integrating long-range connectivity information into de Bruijn graphs , 2017, bioRxiv.

[55]  Yu Lin,et al.  Assembly of long, error-prone reads using repeat graphs , 2018, Nature Biotechnology.

[56]  Mitchell R. Vollger,et al.  Telomere-to-telomere assembly of a complete human X chromosome , 2020, Nature.

[57]  Nathan D. Olson,et al.  Assembly and annotation of an Ashkenazi human reference genome , 2020, bioRxiv.