Haplotype-resolved de novo assembly with phased assembly graphs

Haplotype-resolved de novo assembly is the ultimate solution to the study of sequence variations in a genome. However, existing algorithms either collapse heterozygous alleles into one consensus copy or fail to cleanly separate the haplotypes to produce high-quality phased assemblies. Here we describe hifiasm, a new de novo assembler that takes advantage of long high-fidelity sequence reads to faithfully represent the haplotype information in a phased assembly graph. Unlike other graph-based assemblers that only aim to maintain the contiguity of one haplotype, hifiasm strives to preserve the contiguity of all haplotypes. This feature enables the development of a graph trio binning algorithm that greatly advances over standard trio binning. On three human and five non-human datasets, including California redwood with a $\sim$30-gigabase hexaploid genome, we show that hifiasm frequently delivers better assemblies than existing tools and consistently outperforms others on haplotype-resolved assembly.

[1]  Eugene W. Myers,et al.  A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[2]  M. Hizume,et al.  Flow Cytometric Determination of Genome Size in the Taxodiaceae, Cupressaceae sensu stricto and Sciadopityaceae , 2001 .

[3]  Christopher J. Lee Generating Consensus Sequences from Partial Order Multiple Sequence Alignment Graphs , 2003, Bioinform..

[4]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[5]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[6]  Heng Li,et al.  Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly , 2012, Bioinform..

[7]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[8]  John G. Cleary,et al.  Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines , 2015, bioRxiv.

[9]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[10]  Evgeny M. Zdobnov,et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , 2015, Bioinform..

[11]  Yun Xu,et al.  BitMapper: an efficient all-mapper based on bit-vector computing , 2015, BMC Bioinformatics.

[12]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[13]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[14]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[15]  Ryan L. Collins,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2017, bioRxiv.

[16]  Benjamin Neale,et al.  A synthetic-diploid benchmark for accurate variant calling evaluation , 2018, Nature Methods.

[17]  Sergey Koren,et al.  De novo assembly of haplotype-resolved genomes with trio binning , 2018, Nature Biotechnology.

[18]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[19]  Asif Khalak,et al.  Human Genome Assembly in 100 Minutes , 2019, bioRxiv.

[20]  Heng Li,et al.  Fast and accurate long-read assembly with wtdbg2 , 2019, Nature Methods.

[21]  Sergey Koren,et al.  Telomere-to-telomere assembly of a complete human X chromosome , 2019, bioRxiv.

[22]  Sergey Koren,et al.  Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome , 2019, Nature Biotechnology.

[23]  William T. Harvey,et al.  A fully phased accurate assembly of an individual human genome , 2019, bioRxiv.

[24]  Jeffrey P. Mower,et al.  Origin and evolution of the octoploid strawberry genome , 2019, Nature Genetics.

[25]  Jonathan Wood,et al.  Identifying and removing haplotypic duplication in primary genome assemblies , 2019, bioRxiv.

[26]  Yu Lin,et al.  Assembly of long, error-prone reads using repeat graphs , 2018, Nature Biotechnology.

[27]  Chunlin Xiao,et al.  An open resource for accurately benchmarking small variant and reference calls , 2019, Nature Biotechnology.

[28]  Tobias Marschall,et al.  Chromosome-scale, haplotype-resolved assembly of human genomes , 2020, Nature biotechnology.

[29]  Chong Chu,et al.  The design and construction of reference pangenome graphs with minigraph , 2020, Genome Biology.

[30]  Ying Chen,et al.  Fast and accurate assembly of Nanopore reads via progressive error correction and adaptive read selection , 2020, bioRxiv.

[31]  Sergey Koren,et al.  HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads , 2020, bioRxiv.

[32]  Nicholas W. Maurer,et al.  Highly accurate long-read HiFi sequencing data for five complex genomes , 2020, Scientific Data.

[33]  Sergey Koren,et al.  Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes , 2020, Nature Biotechnology.