Efficient long single molecule sequencing for cost effective and accurate sequencing, haplotyping, and de novo assembly

Obtaining accurate sequences from long DNA molecules is very important for genome assembly and other applications. Here we describe single tube long fragment read (stLFR), a technology that enables this a low cost. It is based on adding the same barcode sequence to sub-fragments of the original long DNA molecule (DNA co-barcoding). To achieve this efficiently, stLFR uses the surface of microbeads to create millions of miniaturized barcoding reactions in a single tube. Using a combinatorial process up to 3.6 billion unique barcode sequences were generated on beads, enabling practically non-redundant co-barcoding with 50 million barcodes per sample. Using stLFR, we demonstrate efficient unique co-barcoding of over 8 million 20-300 kb genomic DNA fragments. Analysis of the genome of the human genome NA12878 with stLFR demonstrated high quality variant calling and phasing into contigs up to N50 34 Mb. We also demonstrate detection of complex structural variants and complete diploid de novo assembly of NA12878. These analyses were all performed using single stLFR libraries and their construction did not significantly add to the time or cost of whole genome sequencing (WGS) library preparation. stLFR represents an easily automatable solution that enables high quality sequencing, phasing, SV detection, scaffolding, cost-effective diploid de novo genome assembly, and other long DNA sequencing applications.

[1]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[2]  Andrew C. Adey,et al.  Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing , 2014, Nature Genetics.

[3]  Jessica C. Ebert,et al.  Accurate whole genome sequencing and haplotyping from10-20 human cells , 2012, Nature.

[4]  Kui Zhang,et al.  Direct determination of molecular haplotypes by chromosome microdissection , 2010, Nature Methods.

[5]  Katja Nowick,et al.  A comprehensively molecular haplotype-resolved genome of a European individual. , 2011, Genome research.

[6]  Brent S. Pedersen,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, Nature Biotechnology.

[7]  Ou Wang,et al.  3’ Branch Ligation: A Novel Method to Ligate Non-Complementary DNA to Recessed or Internal 3’OH Ends in DNA or RNA , 2018, bioRxiv.

[8]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[9]  Andrew C. Adey,et al.  Haplotype-resolved genome sequencing of a Gujarati Indian individual , 2011, Nature Biotechnology.

[10]  A. Alexeev,et al.  cPAS-based sequencing on the BGISEQ-500 to explore small non-coding RNAs , 2016, Clinical Epigenetics.

[11]  Juan J de Pablo,et al.  Elongation and migration of single DNA molecules in microchannels using oscillatory shear flows. , 2009, Lab on a chip.

[12]  Hanlee P. Ji,et al.  Haplotyping germline and cancer genomes using high-throughput linked-read sequencing , 2015, Nature Biotechnology.

[13]  Jessica A. Weber,et al.  The Sentieon Genomics Tools – A fast and accurate solution to variant calling from next-generation sequence data , 2017, bioRxiv.

[14]  Howard Y. Chang,et al.  Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position , 2013, Nature Methods.

[15]  S. Turner,et al.  Zero-Mode Waveguides for Single-Molecule Analysis at High Concentrations , 2003, Science.

[16]  Alexander Wait Zaranek,et al.  The whole genome sequences and experimentally phased haplotypes of over 100 personal genomes , 2016, GigaScience.

[17]  Radoje Drmanac,et al.  Co-barcoded sequence reads from long DNA fragments: a cost-effective solution for “perfect genome” sequencing , 2015, Front. Genet..

[18]  Jun Zhang,et al.  Low-pass whole-genome sequencing in clinical cytogenetics: a validated approach , 2016, Genetics in Medicine.

[19]  John G. Cleary,et al.  Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines , 2015, bioRxiv.

[20]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[21]  Vineet Bafna,et al.  HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies , 2017, Genome research.

[22]  T. Sicheritz-Pontén,et al.  Comparative performance of the BGISEQ-500 vs Illumina HiSeq2500 sequencing platforms for palaeogenomic sequencing , 2017, GigaScience.

[23]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[24]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[25]  Bing Ren,et al.  Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing , 2013, Nature Biotechnology.

[26]  Mark Akeson,et al.  Replication of Individual DNA Molecules under Electronic Control Using a Protein Nanopore , 2010, Nature nanotechnology.

[27]  K. Verstrepen,et al.  Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques , 2011, Nucleic acids research.

[28]  Xun Xu,et al.  A simple bead-based method for generating cost-effective co-barcoded sequence reads , 2018, Protocol Exchange.

[29]  Jay Shendure,et al.  Long-range polony haplotyping of individual human chromosome molecules , 2006, Nature Genetics.

[30]  Hui Jiang,et al.  A reference human genome dataset of the BGISEQ-500 sequencer , 2017, GigaScience.

[31]  Dmitry Pushkarev,et al.  Whole-genome haplotyping using long reads and statistical methods , 2014, Nature Biotechnology.

[32]  Stephen R Quake,et al.  Whole-genome molecular haplotyping of single cells , 2011, Nature Biotechnology.

[33]  Hui Jiang,et al.  Identification of Balanced Chromosomal Rearrangements Previously Unknown Among Participants in the 1000 Genomes Project: Implications for Interpretation of Structural Variation in Genomes and the Future of Clinical Cytogenetics , 2017, Genetics in Medicine.

[34]  S. Oliver,et al.  Estimating the total number of phosphoproteins and phosphorylation sites in eukaryotic proteomes , 2017, GigaScience.

[35]  Jay Shendure,et al.  Haplotype phasing of whole human genomes using bead-based barcode partitioning in a single tube , 2017, Nature Biotechnology.

[36]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[37]  Robert B. Hartlage,et al.  This PDF file includes: Materials and Methods , 2009 .

[38]  C. Landry,et al.  Transcriptome sequences spanning key developmental states as a resource for the study of the cestode Schistocephalus solidus, a threespine stickleback parasite , 2016, GigaScience.

[39]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.