论文信息 - Ratatosk: hybrid error correction of long reads enables accurate variant calling and assembly

Ratatosk: hybrid error correction of long reads enables accurate variant calling and assembly

Motivation Long Read Sequencing (LRS) technologies are becoming essential to complement Short Read Sequencing (SRS) technologies for routine whole genome sequencing. LRS platforms produce DNA fragment reads, from 103 to 106 bases, allowing the resolution of numerous uncertainties left by SRS reads for genome reconstruction and analysis. In particular, LRS characterizes long and complex structural variants undetected by SRS due to short read length. Furthermore, assemblies produced with LRS reads are considerably more contiguous than with SRS while spanning previously inaccessible telomeric and centromeric regions. However, a major challenge to LRS reads adoption is their much higher error rate than SRS of up to 15%, introducing obstacles in downstream analysis pipelines. Results We present Ratatosk, a new error correction method for erroneous long reads based on a compacted and colored de Bruijn graph built from accurate short reads. Short and long reads color paths in the graph while vertices are annotated with candidate Single Nucleotide Polymorphisms. Long reads are subsequently anchored to the graph using exact and inexact fc-mer matches to find paths corresponding to corrected sequences. We demonstrate that Ratatosk can reduce the raw error rate of Oxford Nanopore reads 6-fold on average with a median error rate as low as 0.28%. Ratatosk corrected data maintain nearly 99% accurate SNP calls and increase indel call accuracy by up to about 40% compared to the raw data. An assembly of the Ashkenazi individual HG002 created from Ratatosk corrected Oxford Nanopore reads yields a contig N50 of 43.22 Mbp and less misassemblies than an assembly created from PacBio HiFi reads. Availability https://github.com/DecodeGenetics/Ratatosk Contact guillaume.holley@decode.is

[1] Heng Li,et al. Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[2] Michael S. Waterman,et al. A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[3] V. Bansal,et al. Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing , 2019, Nature Communications.

[4] P. Pevzner,et al. centroFlye: Assembling Centromeres with Long Error-Prone Reads , 2019, bioRxiv.

[5] Arnaud Lefebvre,et al. ELECTOR: evaluator for long reads correction methods , 2019, bioRxiv.

[6] Gonçalo R. Abecasis,et al. The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[7] W. Kloosterman,et al. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy , 2018, Genome Biology.

[8] William Jones,et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference , 2018, Nature Biotechnology.

[9] Piet Demeester,et al. Jabba: Hybrid Error Correction for Long Sequencing Reads Using Maximal Exact Matches , 2015, WABI.

[10] Hannes P. Eggertsson,et al. Whole genome characterization of sequence diversity of 15,220 Icelanders , 2017, Scientific Data.

[11] Tsunglin Liu,et al. Effects of GC Bias in Next-Generation-Sequencing Data on De Novo Genome Assembly , 2013, PloS one.

[12] Owen Kaser,et al. Better bitmap performance with Roaring bitmaps , 2014, Softw. Pract. Exp..

[13] Chirag Jain,et al. A comprehensive evaluation of long read error correction methods , 2020, BMC genomics.

[14] Sergey Koren,et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome , 2019, Nature Biotechnology.

[15] Fritz J Sedlazeck,et al. Piercing the dark matter: bioinformatics of long-range sequencing and mapping , 2018, Nature Reviews Genetics.

[16] Kunihiko Sadakane,et al. Detecting Superbubbles in Assembly Graphs , 2013, WABI.

[17] Kin Fai Au,et al. A comparative evaluation of hybrid error correction methods for error-prone long reads , 2019, Genome Biology.

[18] Piet Demeester,et al. Jabba: hybrid error correction for long sequencing reads , 2015, Algorithms for Molecular Biology.

[19] P. Pevzner,et al. An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[20] Leonard McMillan,et al. FMLRC: Hybrid long read error correction using an FM-index , 2018, BMC Bioinformatics.

[21] Sergey Koren,et al. Merqury: reference-free quality and phasing assessment for genome assemblies , 2020, bioRxiv.

[22] Sergey Koren,et al. Telomere-to-telomere assembly of a complete human X chromosome , 2019, bioRxiv.

[23] Páll Melsted,et al. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs , 2019, Genome Biology.

[24] Michael Roberts,et al. The MaSuRCA genome assembler , 2013, Bioinform..

[25] Alexey A. Gurevich,et al. QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[26] Mitchell R. Vollger,et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads , 2020, bioRxiv.

[27] Leena Salmela,et al. LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[28] Evan E. Eichler,et al. Characterizing the Major Structural Variant Alleles of the Human Genome , 2019, Cell.

[29] Jonathan Wood,et al. Identifying and removing haplotypic duplication in primary genome assemblies , 2019, bioRxiv.

[30] Pierre Marijon,et al. yacrd and fpa: upstream tools for long-read genome assembly , 2019, bioRxiv.

[31] Alexa B. R. McIntyre,et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.