Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies

BackgroundRecent developments in third-gen long read sequencing and diploid-aware assemblers have resulted in the rapid release of numerous reference-quality assemblies for diploid genomes. However, assembly of highly heterozygous genomes is still problematic when regional heterogeneity is so high that haplotype homology is not recognised during assembly. This results in regional duplication rather than consolidation into allelic variants and can cause issues with downstream analysis, for example variant discovery, or haplotype reconstruction using the diploid assembly with unpaired allelic contigs.ResultsA new pipeline—Purge Haplotigs—was developed specifically for third-gen sequencing-based assemblies to automate the reassignment of allelic contigs, and to assist in the manual curation of genome assemblies. The pipeline uses a draft haplotype-fused assembly or a diploid assembly, read alignments, and repeat annotations to identify allelic variants in the primary assembly. The pipeline was tested on a simulated dataset and on four recent diploid (phased) de novo assemblies from third-generation long-read sequencing, and compared with a similar tool. After processing with Purge Haplotigs, haploid assemblies were less duplicated with minimal impact on genome completeness, and diploid assemblies had more pairings of allelic contigs.ConclusionsPurge Haplotigs improves the haploid and diploid representations of third-gen sequencing based genome assemblies by identifying and reassigning allelic contigs. The implementation is fast and scales well with large genomes, and it is less likely to over-purge repetitive or paralogous elements compared to alignment-only based methods. The software is available at https://bitbucket.org/mroachawri/purge_haplotigs under a permissive MIT licence.

[1]  Jill P Mesirov,et al.  Assembly of polymorphic genomes: algorithms and application to Ciona savignyi. , 2005, Genome research.

[2]  Toni Gabaldón,et al.  Redundans: an assembly pipeline for highly heterozygous genomes , 2015 .

[3]  T. Michael,et al.  Extreme haplotype variation in the desiccation-tolerant clubmoss Selaginella lepidophylla , 2018, Nature Communications.

[4]  Leszek P. Pryszcz,et al.  Genome Comparison of Candida orthopsilosis Clinical Strains Reveals the Existence of Hybrids between Two Distinct Subspecies , 2014, Genome biology and evolution.

[5]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[6]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[7]  Tetsuya Hayashi,et al.  Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads , 2014, Genome research.

[8]  J. Gouzy,et al.  High-quality de novo assembly of the apple genome and methylome dynamics of early fruit development , 2017, Nature Genetics.

[9]  Jonas Korlach,et al.  De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads , 2017, GigaScience.

[10]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools , 2011, Nucleic Acids Res..

[11]  Shengfeng Huang,et al.  HaploMerger2: rebuilding both haploid sub-assemblies from high-heterozygosity diploid genome assembly , 2017, Bioinform..

[12]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[13]  Steven G. Schroeder,et al.  Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome , 2017, Nature Genetics.

[14]  Marisa E. Miller,et al.  A Near-Complete Haplotype-Phased Genome of the Dikaryotic Wheat Stripe Rust Fungus Puccinia striiformis f. sp. tritici Reveals High Interhaplotype Diversity , 2018, mBio.

[15]  Pavel A. Pevzner,et al.  dipSPAdes: Assembler for Highly Polymorphic Diploid Genomes , 2014, RECOMB.

[16]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[17]  L. Rieseberg,et al.  The sunflower genome provides insights into oil metabolism, flowering and Asterid evolution , 2017, Nature.

[18]  Matthew M. Hill,et al.  A haplome alignment and reference sequence of the highly polymorphic Ciona savignyi genome , 2007, Genome Biology.

[19]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[20]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[21]  N. Loman,et al.  A complete bacterial genome assembled de novo using only nanopore sequencing data , 2015, Nature Methods.

[22]  Ute Roessner,et al.  The genome of Chenopodium quinoa , 2017, Nature.

[23]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[24]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[25]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[26]  Marc L. Salit,et al.  Best practices for evaluating single nucleotide variant calling methods for microbial genomics , 2015, Front. Genet..

[27]  Evgeny M. Zdobnov,et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , 2015, Bioinform..

[28]  Steven J. M. Jones,et al.  Circos: an information aesthetic for comparative genomics. , 2009, Genome research.

[29]  O. Kohany,et al.  Repbase Update, a database of repetitive elements in eukaryotic genomes , 2015, Mobile DNA.