de novo diploid genome assembly using long noisy reads via haplotype-aware error correction and inconsistent overlap identification

High sequencing errors have impeded the application of long noisy reads for diploid genome assembly. Most existing assemblers failed to distinguish heterozygotes from high sequencing errors in long noisy reads and generate collapsed assemblies with lots of haplotype switches. Here, we present PECAT, a phased error correction and assembly tool for reconstructing diploid genomes from long noisy reads. We design a haplotype-aware error correction method that can retain heterozygote alleles while correcting sequencing errors. We develop a read-level SNP caller that can further reduce the SNP errors in corrected reads. Then, we use a read grouping method to assign reads to different haplotype groups. To accelerate the assembling, PECAT only performs local alignment when it is necessary. PECAT efficiently assembles diploid genomes using only long noisy reads and generates more contiguous haplotype-specific contigs compared to other assemblers. Especially, PECAT achieves nearly haplotype-resolved assembly on B. taurus (Bison×Simmental) using Nanopore reads.

[1]  Heng Li,et al.  Haplotype-resolved assembly of diploid genomes without parental data , 2022, Nature Biotechnology.

[2]  Jordan M. Eizenga,et al.  Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads , 2021, Nature Methods.

[3]  Felipe A. Simão,et al.  BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes , 2021, Molecular biology and evolution.

[4]  T. Kalbfleisch,et al.  A Reference Genome Assembly of American Bison, Bison bison bison , 2021, The Journal of heredity.

[5]  Heng Li,et al.  Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm , 2021, Nature Methods.

[6]  Yingfeng Zheng,et al.  Efficient assembly of nanopore reads via highly accurate and intact error correction , 2021, Nature Communications.

[7]  William T. Harvey,et al.  Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads , 2020, Nature Biotechnology.

[8]  Tobias Marschall,et al.  Chromosome-scale, haplotype-resolved assembly of human genomes , 2020, Nature biotechnology.

[9]  S. Koren,et al.  Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies , 2020, Genome Biology.

[10]  Sergey Koren,et al.  Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes , 2020, Nature Biotechnology.

[11]  Chang Liu,et al.  Chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes , 2020, bioRxiv.

[12]  Sergey Koren,et al.  HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads , 2020, bioRxiv.

[13]  Jonathan Wood,et al.  Identifying and removing haplotypic duplication in primary genome assemblies , 2019, bioRxiv.

[14]  Sergey Koren,et al.  Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome , 2019, Nature Biotechnology.

[15]  Yu Lin,et al.  Assembly of long, error-prone reads using repeat graphs , 2018, Nature Biotechnology.

[16]  Sergey Koren,et al.  De novo assembly of haplotype-resolved genomes with trio binning , 2018, Nature Biotechnology.

[17]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[18]  Feng Luo,et al.  MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads , 2017, Nature Methods.

[19]  Ilan Shomorony,et al.  HINGE: Long-Read Assembly Achieves Optimal Repeat Resolution , 2016, bioRxiv.

[20]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[21]  Paolo Piazza,et al.  Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis , 2017, F1000Research.

[22]  Niranjan Nagarajan,et al.  Fast and accurate de novo genome assembly from long uncorrected reads. , 2017, Genome research.

[23]  Ilan Shomorony,et al.  HINGE: Long-Read Assembly Achieves Optimal Repeat Resolution , 2016, bioRxiv.

[24]  Martin Sosic,et al.  Edlib: a C/C++ library for fast, exact sequence alignment using edit distance , 2016, bioRxiv.

[25]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[26]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[27]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[28]  M. Pop,et al.  Sequence assembly demystified , 2013, Nature Reviews Genetics.

[29]  A. Emili,et al.  Genome-scale genetic manipulation methods for exploring bacterial molecular biology. , 2012, Molecular bioSystems.

[30]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[31]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[32]  Eugene W. Myers,et al.  AnO(ND) difference algorithm and its variations , 1986, Algorithmica.

[33]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[34]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[35]  Sergio M. Savaresi,et al.  On the performance of bisecting K-means and PDDP , 2001, SDM.