Nanopore sequencing and assembly of a human genome with ultra-long reads

We report the sequencing and assembly of a reference genome for the human GM12878 Utah/Ceph cell line using the MinION (Oxford Nanopore Technologies) nanopore sequencer. 91.2 Gb of sequence data, representing ∼30× theoretical coverage, were produced. Reference-based alignment enabled detection of large structural variants and epigenetic modifications. De novo assembly of nanopore reads alone yielded a contiguous assembly (NG50 ∼3 Mb). We developed a protocol to generate ultra-long reads (N50 > 100 kb, read lengths up to 882 kb). Incorporating an additional 5× coverage of these ultra-long reads more than doubled the assembly contiguity (NG50 ∼6.4 Mb). The final assembled genome was 2,867 million bases in size, covering 85.8% of the reference. Assembly accuracy, after incorporating complementary short-read sequencing data, exceeded 99.8%. Ultra-long reads enabled assembly and phasing of the 4-Mb major histocompatibility complex (MHC) locus in its entirety, measurement of telomere repeat length, and closure of gaps in the reference human genome assembly GRCh38.

[1]  Heng Li,et al.  Minimap2: fast pairwise alignment for long DNA sequences , 2017 .

[2]  Michael C. Schatz,et al.  Accurate detection of complex structural variations using single molecule sequencing , 2017, Nature Methods.

[3]  Lars Bolund,et al.  Sequencing and de novo assembly of 150 genomes from Denmark as a population reference , 2017, Nature.

[4]  R. Durbin,et al.  Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly , 2016, bioRxiv.

[5]  Steven G. Schroeder,et al.  Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome , 2017, Nature Genetics.

[6]  Winston Timp,et al.  Detecting DNA cytosine methylation using nanopore sequencing , 2017, Nature Methods.

[7]  Thomas E. Royce,et al.  Sequences of 95 human MHC haplotypes reveal extreme coding variation in genes other than highly polymorphic HLA class I and II , 2017, Genome research.

[8]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[9]  Jordan M. Eizenga,et al.  Mapping DNA Methylation with High Throughput Nanopore Sequencing , 2017, Nature Methods.

[10]  Srinivas Aluru,et al.  A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases , 2017, bioRxiv.

[11]  Michael Liem,et al.  Rapid de novo assembly of the European eel genome from nanopore sequencing reads , 2017, Scientific Reports.

[12]  Trevor Bedford,et al.  Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples , 2017, Nature Protocols.

[13]  Hugh E. Olsen,et al.  Whole genome sequencing and assembly of a Caenorhabditis elegans genome with complex genomic rearrangements using the MinION sequencing device , 2017, bioRxiv.

[14]  S. Oliver,et al.  Estimating the total number of phosphoproteins and phosphorylation sites in eukaryotic proteomes , 2017, GigaScience.

[15]  Hugh E. Olsen,et al.  The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community , 2016, Genome Biology.

[16]  E. Datema,et al.  The megabase-sized fungal genome of Rhizoctonia solani assembled from nanopore reads only , 2016, bioRxiv.

[17]  J. Korlach,et al.  De novo assembly and phasing of a Korean human genome , 2016, Nature.

[18]  Michael C. Schatz,et al.  Assemblytics: a web analytics tool for the detection of variants from an assembly , 2016, Bioinform..

[19]  Alexander T. Dilthey,et al.  High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs , 2016, PLoS Comput. Biol..

[20]  Stefan Engelen,et al.  de novo assembly and population genomic survey of natural yeast isolates with the Oxford Nanopore MinION sequencer , 2016, bioRxiv.

[21]  Mary Goldman,et al.  Rapid and efficient analysis of 20,000 RNA-seq samples with Toil , 2016, bioRxiv.

[22]  E. Eichler,et al.  Long-read sequencing and de novo assembly of a Chinese genome , 2016, Nature Communications.

[23]  G. McVean,et al.  A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016, bioRxiv.

[24]  Niranjan Nagarajan,et al.  Fast and sensitive mapping of nanopore sequencing reads with GraphMap , 2016, Nature Communications.

[25]  David Haussler,et al.  Long-read sequence assembly of the gorilla genome , 2016, Science.

[26]  David A. Matthews,et al.  Real-time, portable genome sequencing for Ebola surveillance , 2016, Nature.

[27]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[28]  Ilan Shomorony,et al.  Do read errors matter for genome assembly? , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[29]  Evan E. Eichler,et al.  Genetic variation and the de novo assembly of human genomes , 2015, Nature Reviews Genetics.

[30]  Yunfan Fan,et al.  Nanopore sequencing detects structural variants in cancer , 2015, bioRxiv.

[31]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[32]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[33]  Russell E. Durrett,et al.  Assembly and diploid architecture of an individual human genome via single-molecule technologies , 2015, Nature Methods.

[34]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[35]  Joshua Quick,et al.  Rapid draft sequencing and real-time nanopore sequencing in a hospital outbreak of Salmonella , 2015, Genome Biology.

[36]  Leo van Iersel,et al.  WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads , 2015, J. Comput. Biol..

[37]  N. Loman,et al.  A complete bacterial genome assembled de novo using only nanopore sequencing data , 2015, Nature Methods.

[38]  Benedict Paten,et al.  Improved data analysis for the MinION nanopore sequencer , 2015, Nature Methods.

[39]  Ryan M. Layer,et al.  SpeedSeq: Ultra-fast personal genome analysis and interpretation , 2014, Nature Methods.

[40]  James Robinson,et al.  The IPD and IMGT/HLA database: allele variant databases , 2014, Nucleic Acids Res..

[41]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[42]  Christina A. Cuomo,et al.  Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement , 2014, PloS one.

[43]  M. Akeson,et al.  Nanopores Discriminate among Five C5-Cytosine Variants in DNA , 2014, Journal of the American Chemical Society.

[44]  Leo van Iersel,et al.  WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads , 2014, RECOMB.

[45]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[46]  I. Derrington,et al.  Detection and mapping of 5-methylcytosine and 5-hydroxymethylcytosine with nanopore MspA , 2013, Proceedings of the National Academy of Sciences.

[47]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[48]  David Tse,et al.  Optimal assembly for high throughput shotgun sequencing , 2013, BMC Bioinformatics.

[49]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.

[50]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[51]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[52]  Bernard P. Puc,et al.  An integrated semiconductor device enabling non-optical genome sequencing , 2011, Nature.

[53]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[54]  C. Harley,et al.  Measurement of telomere length by the Southern blot analysis of terminal restriction fragment lengths , 2010, Nature Protocols.

[55]  David Haussler,et al.  Cactus Graphs for Genome Comparisons , 2010, RECOMB.

[56]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[57]  Dmitry Pushkarev,et al.  Single-molecule sequencing of an individual human genome , 2009, Nature Biotechnology.

[58]  P. Warburton,et al.  Analysis of the largest tandemly repeated DNA families in the human genome , 2008, BMC Genomics.

[59]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[60]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[61]  David Haussler,et al.  Using native and syntenically mapped cDNA alignments to improve de novo gene finding , 2008, Bioinform..

[62]  E. Eichler,et al.  Closing gaps in the human genome with fosmid resources generated from multiple individuals , 2008, Nature Genetics.

[63]  David Haussler,et al.  Comparative Genomics Search for Losses of Long-Established Genes on the Human Lineage , 2007, PLoS Comput. Biol..

[64]  C. V. Jongeneel,et al.  Identification of a new cancer/testis gene family, CT47, among expressed multicopy genes on the human X chromosome , 2006, Genes, chromosomes & cancer.

[65]  Evan E. Eichler,et al.  An assessment of the sequence gaps: Unfinished business in a finished human genome , 2004, Nature Reviews Genetics.

[66]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[67]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[68]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[69]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[70]  H. Willard,et al.  Long-range organization of tandem arrays of alpha satellite DNA at the centromeres of human chromosomes: high-frequency array-length polymorphism and meiotic stability. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[71]  H. Sambrook Molecular cloning : a laboratory manual. Cold Spring Harbor, NY , 1989 .

[72]  L. S. Cram,et al.  A highly conserved repetitive DNA sequence, (TTAGGG)n, present at the telomeres of human chromosomes. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[73]  J. Sambrook,et al.  Molecular Cloning: A Laboratory Manual , 2001 .