Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit

Present workflows for producing human genome assemblies from long-read technologies have cost and production time bottlenecks that prohibit efficient scaling to large cohorts. We demonstrate an optimized PromethION nanopore sequencing method for eleven human genomes. The sequencing, performed on one machine in nine days, achieved an average 63x coverage, 42 Kb read N50, 90% median read identity and 6.5x coverage in 100 Kb+ reads using just three flow cells per sample. To assemble these data we introduce new computational tools: Shasta - a de novo long read assembler, and MarginPolish & HELEN - a suite of nanopore assembly polishing algorithms. On a single commercial compute node Shasta can produce a complete human genome assembly in under six hours, and MarginPolish & HELEN can polish the result in just over a day, achieving 99.9% identity (QV30) for haploid samples from nanopore reads alone. We evaluate assembly performance for diploid, haploid and trio-binned human samples in terms of accuracy, cost, and time and demonstrate improvements relative to current state-of-the-art methods in all areas. We further show that addition of proximity ligation (Hi-C) sequencing yields near chromosome-level scaffolds for all eleven genomes.

[1]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[2]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[3]  Thomas Colthurst,et al.  A universal SNP and small-indel variant caller using deep neural networks , 2018, Nature Biotechnology.

[4]  David Haussler,et al.  Cactus: Algorithms for genome multiple sequence alignment. , 2011, Genome research.

[5]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[6]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[7]  Evan E. Eichler,et al.  An assessment of the sequence gaps: Unfinished business in a finished human genome , 2004, Nature Reviews Genetics.

[8]  Jonas Korlach,et al.  Discovery and genotyping of structural variation from long-read haploid genome sequence data , 2017, Genome research.

[9]  David Haussler,et al.  Linear assembly of a human centromere on the Y chromosome , 2018, Nature Biotechnology.

[10]  Juan Carlos Castilla-Rubio,et al.  Earth BioGenome Project: Sequencing life for the future of life , 2018, Proceedings of the National Academy of Sciences.

[11]  Michael C. Schatz,et al.  SVCollector: Optimized sample selection for validating and long-read resequencing of structural variants , 2018, bioRxiv.

[12]  Michael C. Schatz,et al.  Third-generation sequencing and the future of genomics , 2016, bioRxiv.

[13]  Ryan L. Collins,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2017, bioRxiv.

[14]  Chunlin Xiao,et al.  An open resource for accurately benchmarking small variant and reference calls , 2019, Nature Biotechnology.

[15]  Mark Gerstein,et al.  GENCODE reference annotation for the human and mouse genomes , 2018, Nucleic Acids Res..

[16]  Zev N. Kronenberg,et al.  Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads , 2019, bioRxiv.

[17]  Benedict Paten,et al.  Improved data analysis for the MinION nanopore sequencer , 2015, Nature Methods.

[18]  Sergey Koren,et al.  Highly-accurate long-read sequencing improves variant detection and assembly of a human genome , 2019, bioRxiv.

[19]  J. Dekker,et al.  Hi-C: a comprehensive technique to capture the conformation of genomes. , 2012, Methods.

[20]  Brendan L. O’Connell,et al.  Chromosome-scale shotgun assembly using an in vitro method for long-range linkage , 2015, Genome research.

[21]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[22]  Dmitry Antipov,et al.  Versatile genome assembly evaluation with QUAST-LG , 2018, Bioinform..

[23]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[24]  Vitor R. C. Aguiar,et al.  Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data , 2014, G3: Genes, Genomes, Genetics.

[25]  Minsheng Peng,et al.  Hybrid assembly of ultra-long Nanopore reads augmented with 10x-Genomics contigs: Demonstrated with a human genome. , 2019, Genomics.

[26]  Heng Li,et al.  Fast and accurate long-read assembly with wtdbg2 , 2019, Nature Methods.

[27]  Leo van Iersel,et al.  WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads , 2015, J. Comput. Biol..

[28]  Michael C. Schatz,et al.  Accurate detection of complex structural variations using single molecule sequencing , 2017, Nature Methods.

[29]  Joshua M. Stuart,et al.  Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. , 2009, The Journal of heredity.

[30]  J D Hayhurst,et al.  Single molecule real‐time DNA sequencing of HLA genes at ultra‐high resolution from 126 International HLA and Immunogenetics Workshop cell lines , 2018, HLA.

[31]  Richard J. Anderson,et al.  Wait-free parallel algorithms for the union-find problem , 1991, STOC '91.

[32]  Y. Kamatani,et al.  Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing , 2019, Genome Biology.

[33]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[34]  Benedict Paten,et al.  Haplotype-aware diplotyping from noisy long reads , 2019, Genome Biology.

[35]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[36]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[37]  Kunihiko Sadakane,et al.  Detecting Superbubbles in Assembly Graphs , 2013, WABI.

[38]  Evgeny M. Zdobnov,et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , 2015, Bioinform..

[39]  S. Koren,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, bioRxiv.

[40]  Aaron M Wenger,et al.  Improved assembly and variant detection of a haploid human genome using single‐molecule, high‐fidelity long reads , 2019, Annals of human genetics.

[41]  Niranjan Nagarajan,et al.  Fast and accurate de novo genome assembly from long uncorrected reads. , 2017, Genome research.

[42]  Shilpa Garg,et al.  A graph-based approach to diploid genome assembly , 2018, Bioinform..

[43]  Benedict Paten,et al.  Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment , 2009, Bioinform..

[44]  Sergey Koren,et al.  De novo assembly of haplotype-resolved genomes with trio binning , 2018, Nature Biotechnology.

[45]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[46]  N. Loman,et al.  A complete bacterial genome assembled de novo using only nanopore sequencing data , 2015, Nature Methods.

[47]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[48]  David Haussler,et al.  High-resolution comparative analysis of great ape genomes , 2018, Science.

[49]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[50]  Sergey Koren,et al.  A robust benchmark for germline structural variant detection , 2019, bioRxiv.

[51]  David Haussler,et al.  Human-Specific NOTCH2NL Genes Affect Notch Signaling and Cortical Neurogenesis , 2018, Cell.

[52]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[53]  Yu Lin,et al.  Assembly of long, error-prone reads using repeat graphs , 2018, Nature Biotechnology.

[54]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[55]  Lakhmi C. Jain,et al.  Recurrent Neural Networks: Design and Applications , 1999 .

[56]  Ian T. Fiddes Comparative Annotation Toolkit (CAT) - Simultaneous Clade and Personal Genome Annotation , 2018, Genome research.

[57]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[58]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[59]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[60]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[61]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[62]  W. Kloosterman,et al.  From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy , 2018, Genome Biology.

[63]  Peter M Lansdorp,et al.  Strand-seq: a unifying tool for studies of chromosome segregation. , 2013, Seminars in cell & developmental biology.