Assessment of human diploid genome assembly with 10x Linked-Reads data

Abstract Background Producing cost-effective haplotype-resolved personal genomes remains challenging. 10x Linked-Read sequencing, with its high base quality and long-range information, has been demonstrated to facilitate de novo assembly of human genomes and variant detection. In this study, we investigate in depth how the parameter space of 10x library preparation and sequencing affects assembly quality, on the basis of both simulated and real libraries. Results We prepared and sequenced eight 10x libraries with a diverse set of parameters from standard cell lines NA12878 and NA24385 and performed whole-genome assembly on the data. We also developed the simulator LRTK-SIM to follow the workflow of 10x data generation and produce realistic simulated Linked-Read data sets. We found that assembly quality could be improved by increasing the total sequencing coverage (C) and keeping physical coverage of DNA fragments (CF) or read coverage per fragment (CR) within broad ranges. The optimal physical coverage was between 332× and 823× and assembly quality worsened if it increased to >1,000× for a given C. Long DNA fragments could significantly extend phase blocks but decreased contig contiguity. The optimal length-weighted fragment length (W\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}${\mu _{FL}}$\end{document}) was ∼50–150 kb. When broadly optimal parameters were used for library preparation and sequencing, ∼80% of the genome was assembled in a diploid state. Conclusions The Linked-Read libraries we generated and the parameter space we identified provide theoretical considerations and practical guidelines for personal genome assemblies based on 10x Linked-Read sequencing.

[1]  Joachim Weischenfeldt,et al.  SvABA: genome-wide detection of structural variants and indels by local assembly , 2018, Genome research.

[2]  Birgit Funke,et al.  Best practices for benchmarking germline small-variant calls in human genomes , 2019, Nature Biotechnology.

[3]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[4]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[5]  L. Wain,et al.  Haplotype estimation for biobank scale datasets , 2016, Nature Genetics.

[6]  Francesca Giordano,et al.  Oxford Nanopore MinION Sequencing and Genome Assembly , 2016, Genom. Proteom. Bioinform..

[7]  Sergey Koren,et al.  Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome , 2019, Nature Biotechnology.

[8]  Jian Wang,et al.  De novo assembly of a haplotype-resolved human genome , 2015, Nature Biotechnology.

[9]  S. Koren,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, bioRxiv.

[10]  Steven J. M. Jones,et al.  The Genome of the Northern Sea Otter (Enhydra lutris kenyoni) , 2017, Genes.

[11]  Russell E. Durrett,et al.  Assembly and diploid architecture of an individual human genome via single-molecule technologies , 2015, Nature Methods.

[12]  Jian Wang,et al.  Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly , 2019, Genome research.

[13]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[14]  Minsheng Peng,et al.  Hybrid assembly of ultra-long Nanopore reads augmented with 10x-Genomics contigs: Demonstrated with a human genome. , 2019, Genomics.

[15]  Mark J. P. Chaisson,et al.  Reconstructing complex regions of genomes using long-read sequencing technology , 2014, Genome research.

[16]  Jay Shendure,et al.  Haplotype phasing of whole human genomes using bead-based barcode partitioning in a single tube , 2017, Nature Biotechnology.

[17]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[18]  O. Delaneau,et al.  Supplementary Information for ‘ Improved whole chromosome phasing for disease and population genetic studies ’ , 2012 .

[19]  Leo van Iersel,et al.  WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads , 2015, J. Comput. Biol..

[20]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[21]  Sergey Koren,et al.  A robust benchmark for germline structural variant detection , 2019, bioRxiv.

[22]  Jinyang Zhao,et al.  Genome sequencing of the sweetpotato whitefly Bemisia tabaci MED/Q , 2017, GigaScience.

[23]  Vineet Bafna,et al.  HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies , 2017, Genome research.

[24]  Chunlin Xiao,et al.  An open resource for accurately benchmarking small variant and reference calls , 2019, Nature Biotechnology.

[25]  Ross M. Fraser,et al.  A General Approach for Haplotype Phasing across the Full Spectrum of Relatedness , 2014, PLoS genetics.

[26]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[27]  Dmitry Pushkarev,et al.  Whole-genome haplotyping using long reads and statistical methods , 2014, Nature Biotechnology.

[28]  Tetsuya Hayashi,et al.  Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads , 2014, Genome research.

[29]  Jared C. Roach,et al.  Chromosomal haplotypes by genetic phasing of human families. , 2011, American journal of human genetics.

[30]  Benjamin J. Raphael,et al.  Identifying structural variants using linked-read sequencing data , 2017, bioRxiv.

[31]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[32]  Jessica C. Ebert,et al.  Accurate whole genome sequencing and haplotyping from10-20 human cells , 2012, Nature.

[33]  Michael C. Schatz,et al.  LRSim: A Linked-Reads Simulator Generating Insights for Better Genome Partitioning , 2017, Computational and structural biotechnology journal.

[34]  Deanna M. Church,et al.  Reference quality assembly of the 3.5-Gb genome of Capsicum annuum from a single linked-read library , 2017, Horticulture Research.

[35]  F. van Nieuwerburgh,et al.  Library construction for next-generation sequencing: overviews and challenges. , 2014, BioTechniques.

[36]  Dmitry Antipov,et al.  Versatile genome assembly evaluation with QUAST-LG , 2018, Bioinform..

[37]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[38]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2011, Nature Reviews Genetics.

[39]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[40]  Serafim Batzoglou,et al.  High-quality genome sequences of uncultured microbes by assembly of read clouds , 2018, Nature Biotechnology.

[41]  P. Kwok,et al.  A Hybrid Approach for de novo Human Genome Sequence Assembly and Phasing , 2016, Nature Methods.

[42]  Hanlee P. Ji,et al.  Haplotyping germline and cancer genomes using high-throughput linked-read sequencing , 2015, Nature Biotechnology.

[43]  J. Shendure,et al.  DNA sequencing at 40: past, present and future , 2017, Nature.

[44]  Ryan E. Mills,et al.  A recurrence-based approach for validating structural variation using long-read sequencing technology , 2017, bioRxiv.

[45]  Serafim Batzoglou,et al.  Genome-wide reconstruction of complex structural variants using read clouds , 2016, Nature Methods.