Accurate haplotype-resolved assembly reveals the origin of structural variants for human trios

Abstract Motivation Achieving a near complete understanding of how the genome of an individual affects the phenotypes of that individual requires deciphering the order of variations along homologous chromosomes in species with diploid genomes. However, true diploid assembly of long-range haplotypes remains challenging. Results To address this, we have developed Haplotype-resolved Assembly for Synthetic long reads using a Trio-binning strategy, or HAST, which uses parental information to classify reads into maternal or paternal. Once sorted, these reads are used to independently de novo assemble the parent-specific haplotypes. We applied HAST to cobarcoded second-generation sequencing data from an Asian individual, resulting in a haplotype assembly covering 94.7% of the reference genome with a scaffold N50 longer than 11 Mb. The high haplotyping precision (∼99.7%) and recall (∼95.9%) represents a substantial improvement over the commonly used tool for assembling cobarcoded reads (Supernova), and is comparable to a trio-binning-based third generation long-read-based assembly method (TrioCanu) but with a significantly higher single-base accuracy [up to 99.99997% (Q65)]. This makes HAST a superior tool for accurate haplotyping and future haplotype-based studies. Availability and implementation The code of the analysis is available at https://github.com/BGI-Qingdao/HAST Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Yang Wang,et al.  Robust Benchmark Structural Variant Calls of An Asian Using State-of-the-art Long-read Sequencing Technologies , 2021, Genom. Proteom. Bioinform..

[2]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[3]  Sue Povey,et al.  Gene map of the extended human MHC , 2004, Nature Reviews Genetics.

[4]  Evgeny M. Zdobnov,et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , 2015, Bioinform..

[5]  R. Wilson,et al.  Modernizing Reference Genome Assemblies , 2011, PLoS biology.

[6]  Vitor R. C. Aguiar,et al.  Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data , 2014, G3: Genes, Genomes, Genetics.

[7]  Fei Gao,et al.  CNGBdb: China National GeneBank DataBase. , 2020, Yi chuan = Hereditas.

[8]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[9]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[10]  Shilpa Garg,et al.  A graph-based approach to diploid genome assembly , 2018, Bioinform..

[11]  B. Charlesworth,et al.  Towards a complete sequence of the human Y chromosome , 2001, Genome Biology.

[12]  Eugene W. Myers,et al.  AnO(ND) difference algorithm and its variations , 1986, Algorithmica.

[13]  Daniel E. Newburger,et al.  Read clouds uncover variation in complex regions of the human genome. , 2015, Genome research.

[14]  Sergey Koren,et al.  De novo assembly of haplotype-resolved genomes with trio binning , 2018, Nature Biotechnology.

[15]  Marco A. Marra,et al.  Massively Parallel Sequencing , 2011, Encyclopedia of Autism Spectrum Disorders.

[16]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[17]  L. Wain,et al.  Haplotype estimation for biobank scale datasets , 2016, Nature Genetics.

[18]  Q. Zeng,et al.  A diploid assembly-based benchmark for variants in the major histocompatibility complex , 2020, Nature communications.

[19]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[20]  Vineet Bafna,et al.  HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies , 2017, Genome research.

[21]  Matthew W. Snyder,et al.  Haplotype-resolved genome sequencing: experimental methods and applications , 2015, Nature Reviews Genetics.

[22]  Serafim Batzoglou,et al.  Genome assembly from synthetic long read clouds , 2016, Bioinform..

[23]  Xun Xu,et al.  TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads , 2020, GigaScience.

[24]  Hanlee P. Ji,et al.  Haplotyping germline and cancer genomes using high-throughput linked-read sequencing , 2015, Nature Biotechnology.

[25]  Y. Rogers,et al.  Genomics: Massively parallel sequencing , 2005, Nature.

[26]  Radoje Drmanac,et al.  Co-barcoded sequence reads from long DNA fragments: a cost-effective solution for “perfect genome” sequencing , 2015, Front. Genet..

[27]  Yang Wang,et al.  Robust Benchmark Structural Variant Calls of An Asian Using the State-of-Art Long Fragment Sequencing Technologies , 2020, bioRxiv.

[28]  G. McVean,et al.  A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016, bioRxiv.

[29]  Jessica C. Ebert,et al.  Accurate whole genome sequencing and haplotyping from10-20 human cells , 2012, Nature.

[30]  Han Fang,et al.  GenomeScope: Fast reference-free genome profiling from short reads , 2016, bioRxiv.

[31]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[32]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[33]  Zhaohui S. Qin,et al.  A comparison of phasing algorithms for trios and unrelated individuals. , 2006, American journal of human genetics.

[34]  S. Koren,et al.  Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies , 2020, Genome Biology.

[35]  Fei Gao,et al.  CNSA: a data repository for archiving omics data , 2020, Database : the journal of biological databases and curation.

[36]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[37]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[38]  W. G. Hill,et al.  Estimation of linkage disequilibrium in randomly mating populations , 1974, Heredity.

[39]  Jian Wang,et al.  Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly , 2019, Genome research.

[40]  Yi Luo,et al.  How independent are the appearances of n-mers in different genomes? , 2004, Bioinform..

[41]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[42]  Tobias Marschall,et al.  Chromosome-scale, haplotype-resolved assembly of human genomes , 2020, Nature biotechnology.

[43]  Xingtan Zhang,et al.  Unzipping haplotypes in diploid and polyploid genomes , 2019, Computational and structural biotechnology journal.

[44]  Xin Zhou,et al.  Aquila_stLFR: assembly based variant calling package for stLFR and hybrid assembly for linked-reads , 2019, bioRxiv.

[45]  Timothy P. L. Smith,et al.  Haplotype-resolved genomes provide insights into structural variation and gene content in Angus and Brahman cattle , 2020, Nature Communications.

[46]  Jue Ruan,et al.  SRY: An Effective Method for Sorting Long Reads of Sex-limited Chromosome , 2020, bioRxiv.