Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes

De novo assembly of a human genome using nanopore long-read sequences has been reported, but it used more than 150,000 CPU hours and weeks of wall-clock time. To enable rapid human genome assembly, we present Shasta, a de novo long-read assembler, and polishing algorithms named MarginPolish and HELEN. Using a single PromethION nanopore sequencer and our toolkit, we assembled 11 highly contiguous human genomes de novo in 9 d. We achieved roughly 63× coverage, 42-kb read N50 values and 6.5× coverage in reads >100 kb using three flow cells per sample. Shasta produced a complete haploid human genome assembly in under 6 h on a single commercial compute node. MarginPolish and HELEN polished haploid assemblies to more than 99.9% identity (Phred quality score QV = 30) with nanopore reads alone. Addition of proximity-ligation sequencing enabled near chromosome-level scaffolds for all 11 genomes. We compare our assembly performance to existing methods for diploid, haploid and trio-binned human samples and report superior accuracy and speed. High contiguity human genomes can be assembled de novo in 6 h using nanopore long-read sequences and the Shasta toolkit.

[1]  Minsheng Peng,et al.  Hybrid assembly of ultra-long Nanopore reads augmented with 10x-Genomics contigs: Demonstrated with a human genome. , 2019, Genomics.

[2]  Vitor R. C. Aguiar,et al.  Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data , 2014, G3: Genes, Genomes, Genetics.

[3]  Michael C. Schatz,et al.  Accurate detection of complex structural variations using single molecule sequencing , 2017, Nature Methods.

[4]  Benedict Paten,et al.  Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment , 2009, Bioinform..

[5]  Leo van Iersel,et al.  WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads , 2015, J. Comput. Biol..

[6]  Brent S. Pedersen,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, Nature Biotechnology.

[7]  Ian T. Fiddes Comparative Annotation Toolkit (CAT) - Simultaneous Clade and Personal Genome Annotation , 2018, Genome research.

[8]  Brendan L. O’Connell,et al.  Chromosome-scale shotgun assembly using an in vitro method for long-range linkage , 2015, Genome research.

[9]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[10]  Heng Li,et al.  Fast and accurate long-read assembly with wtdbg2 , 2019, Nature Methods.

[11]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[12]  Chunlin Xiao,et al.  An open resource for accurately benchmarking small variant and reference calls , 2019, Nature Biotechnology.

[13]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[14]  Bradley P. Coe,et al.  Global diversity, population stratification, and selection of human copy-number variation , 2015, Science.

[15]  Evan E. Eichler,et al.  Characterizing the Major Structural Variant Alleles of the Human Genome , 2019, Cell.

[16]  Christina A. Cuomo,et al.  Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement , 2014, PloS one.

[17]  Benedict Paten,et al.  Improved data analysis for the MinION nanopore sequencer , 2015, Nature Methods.

[18]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[19]  David Haussler,et al.  Cactus: Algorithms for genome multiple sequence alignment. , 2011, Genome research.

[20]  Aaron M Wenger,et al.  Improved assembly and variant detection of a haploid human genome using single‐molecule, high‐fidelity long reads , 2019, Annals of human genetics.

[21]  Thomas Colthurst,et al.  A universal SNP and small-indel variant caller using deep neural networks , 2018, Nature Biotechnology.

[22]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[23]  Y. Kamatani,et al.  Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing , 2019, Genome Biology.

[24]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[25]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[26]  Evan E. Eichler,et al.  An assessment of the sequence gaps: Unfinished business in a finished human genome , 2004, Nature Reviews Genetics.

[27]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[28]  Jonas Korlach,et al.  Discovery and genotyping of structural variation from long-read haploid genome sequence data , 2017, Genome research.

[29]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[30]  M. Kunitski,et al.  Double-slit photoelectron interference in strong-field ionization of the neon dimer , 2018, Nature Communications.

[31]  J D Hayhurst,et al.  Single molecule real‐time DNA sequencing of HLA genes at ultra‐high resolution from 126 International HLA and Immunogenetics Workshop cell lines , 2018, HLA.

[32]  G. Getz,et al.  Scaling computational genomics to millions of individuals with GPUs , 2019, Genome Biology.

[33]  Li Ding,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2018, Nature Communications.

[34]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[35]  Din J. Wasem Mining of Massive Datasets , 2014 .

[36]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[37]  J. Dekker,et al.  Hi-C: a comprehensive technique to capture the conformation of genomes. , 2012, Methods.

[38]  David Haussler,et al.  High-resolution comparative analysis of great ape genomes , 2018, Science.

[39]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[40]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[41]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[42]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[43]  David Haussler,et al.  Human-Specific NOTCH2NL Genes Affect Notch Signaling and Cortical Neurogenesis , 2018, Cell.

[44]  Kunihiko Sadakane,et al.  Detecting Superbubbles in Assembly Graphs , 2013, WABI.

[45]  Peter M Lansdorp,et al.  Strand-seq: a unifying tool for studies of chromosome segregation. , 2013, Seminars in cell & developmental biology.

[46]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[47]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[48]  Andreas Willfahrt,et al.  Polymer gels with tunable ionic Seebeck coefficient for ultra-sensitive printed thermopiles , 2019, Nature Communications.

[49]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[50]  David Haussler,et al.  Linear assembly of a human centromere on the Y chromosome , 2018, Nature Biotechnology.

[51]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[52]  Benedict Paten,et al.  Haplotype-aware diplotyping from noisy long reads , 2019, Genome Biology.

[53]  Lakhmi C. Jain,et al.  Recurrent Neural Networks: Design and Applications , 1999 .

[54]  Sergey Koren,et al.  A robust benchmark for germline structural variant detection , 2019, bioRxiv.

[55]  Evgeny M. Zdobnov,et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , 2015, Bioinform..

[56]  Shilpa Garg,et al.  A graph-based approach to diploid genome assembly , 2018, Bioinform..

[57]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[58]  Yu Lin,et al.  Assembly of long, error-prone reads using repeat graphs , 2018, Nature Biotechnology.

[59]  Sergey Koren,et al.  De novo assembly of haplotype-resolved genomes with trio binning , 2018, Nature Biotechnology.

[60]  Sergey Koren,et al.  Telomere-to-telomere assembly of a complete human X chromosome , 2019, bioRxiv.

[61]  Niranjan Nagarajan,et al.  Fast and accurate de novo genome assembly from long uncorrected reads. , 2017, Genome research.

[62]  Dmitry Antipov,et al.  Versatile genome assembly evaluation with QUAST-LG , 2018, Bioinform..

[63]  Sergey Koren,et al.  Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome , 2019, Nature Biotechnology.

[64]  Richard J. Anderson,et al.  Wait-free parallel algorithms for the union-find problem , 1991, STOC '91.

[65]  N. Loman,et al.  A complete bacterial genome assembled de novo using only nanopore sequencing data , 2015, Nature Methods.

[66]  Michael C. Schatz,et al.  SVCollector: Optimized sample selection for validating and long-read resequencing of structural variants , 2018, bioRxiv.