Hapo-G, Haplotype-Aware Polishing Of Genome Assemblies

Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from short reads to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

[1]  Michael C. Schatz,et al.  Major Impacts of Widespread Structural Variation on Gene Expression and Crop Improvement in Tomato , 2020, Cell.

[2]  Yu Lin,et al.  Assembly of long, error-prone reads using repeat graphs , 2018, Nature Biotechnology.

[3]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[4]  Yu Lin,et al.  Assembly of long, error-prone reads using repeat graphs , 2018, Nature Biotechnology.

[5]  René L. Warren,et al.  ntEdit: scalable genome sequence polishing , 2019, bioRxiv.

[6]  F. Denoeud,et al.  Chromosome-scale assemblies of plant genomes using nanopore long reads and optical maps , 2018, Nature Plants.

[7]  Christina A. Cuomo,et al.  Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement , 2014, PloS one.

[8]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[9]  Qingyong Yang,et al.  Eight high-quality genomes reveal pan-genome architecture and ecotype differentiation of Brassica napus , 2020, Nature Plants.

[10]  Jiang Hu,et al.  NextPolish: a fast and efficient genome polishing tool for long-read assembly , 2019, Bioinform..

[11]  K. Schneeberger,et al.  Chromosome-level assemblies of multiple Arabidopsis genomes reveal hotspots of rearrangements with altered evolutionary dynamics , 2020, Nature Communications.

[12]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[13]  Pierre Peterlongo,et al.  DiscoSnp++: de novo detection of small variants from raw unassembled read set(s) , 2017, bioRxiv.

[14]  Corinne Da Silva,et al.  Long-read assembly of the Brassica napus reference genome Darmor-bzh , 2020, bioRxiv.

[15]  Mick Watson,et al.  Errors in long-read assemblies can critically affect protein prediction , 2019, Nature Biotechnology.

[16]  Wing-Kin Sung,et al.  HyPo: Super Fast & Accurate Polisher for Long Read Genome Assemblies , 2019, bioRxiv.

[17]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[18]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[19]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[20]  Aleksey V. Zimin,et al.  The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies , 2019, bioRxiv.

[21]  Doreen Ware,et al.  Gapless assembly of maize chromosomes using long-read technologies , 2020, Genome Biology.

[22]  S. Koren,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, bioRxiv.

[23]  Niranjan Nagarajan,et al.  Fast and accurate de novo genome assembly from long uncorrected reads. , 2017, Genome research.

[24]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[25]  C. Bachem,et al.  Haplotype-resolved genome analyses of a heterozygous diploid potato , 2020, Nature Genetics.