Genotyping structural variants in pangenome graphs using the vg toolkit

Structural variants (SVs) remain challenging to represent and study relative to point mutations despite their demonstrated importance. We show that variation graphs, as implemented in the vg toolkit, provide an effective means for leveraging SV catalogs for short-read SV genotyping experiments. We benchmarked vg against state-of-the-art SV genotypers using three sequence-resolved SV catalogs generated by recent long-read sequencing studies. In addition, we use assemblies from 12 yeast strains to show that graphs constructed directly from aligned de novo assemblies improve genotyping compared to graphs built from intermediate SV catalogs in the VCF format.

[1]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[2]  Sergey Koren,et al.  A robust benchmark for germline structural variant detection , 2019, bioRxiv.

[3]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[4]  Xin Li,et al.  The impact of structural variation on human gene expression , 2016, Nature Genetics.

[5]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[6]  Chunlin Xiao,et al.  An open resource for accurately benchmarking small variant and reference calls , 2019, Nature Biotechnology.

[7]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[8]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[9]  Ryan L. Collins,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2017, bioRxiv.

[10]  Jan O. Korbel,et al.  Phenotypic impact of genomic structural variation: insights from and for human disease , 2013, Nature Reviews Genetics.

[11]  S. Koren,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, bioRxiv.

[12]  Jonas Korlach,et al.  Discovery and genotyping of structural variation from long-read haploid genome sequence data , 2017, Genome research.

[13]  Ryan M. Layer,et al.  SpeedSeq: Ultra-fast personal genome analysis and interpretation , 2014, Nature Methods.

[14]  Jing Li,et al.  Contrasting evolutionary genome dynamics between domesticated and wild yeasts , 2017, Nature Genetics.

[15]  Michael C. Schatz,et al.  Assemblytics: a web analytics tool for the detection of variants from an assembly , 2016, Bioinform..

[16]  Benedict Paten,et al.  Haplotype-aware graph indexes , 2020, Bioinform..

[17]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[18]  Michael C. Schatz,et al.  Paragraph: a graph-based structural variant genotyper for short-read sequence data , 2019, Genome Biology.

[19]  G. McVean,et al.  Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications , 2014, Nature Genetics.

[20]  Edwin Cuppen,et al.  Mapping and phasing of structural variation in patient genomes using nanopore sequencing , 2017, Nature Communications.

[21]  A. Krogh,et al.  Discovery, genotyping and characterization of structural variation and novel sequence at single nucleotide resolution from de novo genome assemblies on a population scale , 2015, GigaScience.

[22]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[23]  Sebastian Deorowicz,et al.  KMC 3: counting and manipulating k‐mer statistics , 2017, Bioinform..

[24]  Mary Goldman,et al.  Toil enables reproducible, open source, big biomedical data analyses , 2017, Nature Biotechnology.

[25]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[26]  Wan-Ping Lee,et al.  Fast and accurate genomic analyses using genome graphs , 2019, Nature Genetics.

[27]  Anders Krogh,et al.  Accurate genotyping across variant classes and lengths using variant graphs , 2018, Nature Genetics.

[28]  Kari Stefansson,et al.  Graphtyper enables population-scale genotyping using pangenome graphs , 2017, Nature Genetics.

[29]  William Jones,et al.  Variation graph toolkit improves read mapping by representing genetic variation in the reference , 2018, Nature Biotechnology.

[30]  Li Ding,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2018, Nature Communications.

[31]  Benedict Paten,et al.  Superbubbles, Ultrabubbles, and Cacti , 2018, J. Comput. Biol..

[32]  Evan E. Eichler,et al.  Characterizing the Major Structural Variant Alleles of the Human Genome , 2019, Cell.

[33]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[34]  Pieter B. T. Neerincx,et al.  Supplementary Information Whole-genome sequence variation , population structure and demographic history of the Dutch population , 2022 .

[35]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[36]  Lars Bolund,et al.  Sequencing and de novo assembly of 150 genomes from Denmark as a population reference , 2017, Nature.

[37]  Benedict Paten,et al.  Genotyping structural variants in pangenome graphs using the vg toolkit , 2019 .

[38]  Michael C. Schatz,et al.  Paragraph: a graph-based structural variant genotyper for short-read sequence data , 2019, Genome Biology.

[39]  Brent S. Pedersen,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, Nature Biotechnology.

[40]  Serafim Batzoglou,et al.  Genome-wide reconstruction of complex structural variants using read clouds , 2016, Nature Methods.

[41]  Jordan M. Eizenga,et al.  Genome graphs and the evolution of genome inference , 2017, bioRxiv.

[42]  David Haussler,et al.  Cactus: Algorithms for genome multiple sequence alignment. , 2011, Genome research.