ntEdit: scalable genome sequence polishing

In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes. We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled E. coli and C. elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20X), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14s and <3m, on average, on E. coli and C. elegans, respectively. We performed similar benchmarks on a sub-20X coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30-40m on those sequences. We show how ntEdit ran in <2h20m to improve upon long and linked read human genome assemblies of NA12878, using high coverage (54X) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gbp interior and white spruce genomes in <4 and <5h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024. Availability https://github.com/bcgsc/ntedit Supplemental material available online.

[1]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[2]  Evgeny M. Zdobnov,et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , 2015, Bioinform..

[3]  Russell E. Durrett,et al.  Assembly and diploid architecture of an individual human genome via single-molecule technologies , 2015, Nature Methods.

[4]  Niranjan Nagarajan,et al.  Fast and accurate de novo genome assembly from long uncorrected reads. , 2017, Genome research.

[5]  Inanç Birol,et al.  Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data , 2013, Bioinform..

[6]  Hamid Mohamadi,et al.  ntCard: a streaming algorithm for cardinality estimation in genomics data , 2017, Bioinform..

[7]  Chaoyang Zhang,et al.  A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis , 2016, Human Genomics.

[8]  Steven J. M. Jones,et al.  Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism. , 2015, The Plant journal : for cell and molecular biology.

[9]  Christina A. Cuomo,et al.  Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement , 2014, PloS one.

[10]  Sergey Koren,et al.  Reply to ‘Errors in long-read assemblies can critically affect protein prediction’ , 2019, Nature Biotechnology.

[11]  Dmitry Antipov,et al.  Versatile genome assembly evaluation with QUAST-LG , 2018, Bioinform..

[12]  S. Koren,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, bioRxiv.

[13]  Mick Watson,et al.  Errors in long-read assemblies can critically affect protein prediction , 2019, Nature Biotechnology.