A robust benchmark for germline structural variant detection

New technologies and analysis methods are enabling genomic structural variants (SVs) to be detected with ever-increasing accuracy, resolution, and comprehensiveness. Translating these methods to routine research and clinical practice requires robust benchmark sets. We developed the first benchmark set for identification of both false negative and false positive germline SVs, which complements recent efforts emphasizing increasingly comprehensive characterization of SVs. To create this benchmark for a broadly consented son in a Personal Genome Project trio with broadly available cells and DNA, the Genome in a Bottle (GIAB) Consortium integrated 19 sequence-resolved variant calling methods, both alignment- and de novo assembly-based, from short-, linked-, and long-read sequencing, as well as optical and electronic mapping. The final benchmark set contains 12745 isolated, sequence-resolved insertion and deletion calls ≥50 base pairs (bp) discovered by at least 2 technologies or 5 callsets, genotyped as heterozygous or homozygous variants by long reads. The Tier 1 benchmark regions, for which any extra calls are putative false positives, cover 2.66 Gbp and 9641 SVs supported by at least one diploid assembly. Support for SVs was assessed using svviz with short-, linked-, and long-read sequence data. In general, there was strong support from multiple technologies for the benchmark SVs, with 90 % of the Tier 1 SVs having support in reads from more than one technology. The Mendelian genotype error rate was 0.3 %, and genotype concordance with manual curation was >98.7 %. We demonstrate the utility of the benchmark set by showing it reliably identifies both false negatives and false positives in high-quality SV callsets from short-, linked-, and long-read sequencing and optical mapping.

Sergey Koren | Adam M. Phillippy | Iman Hajirasouliha | Marc L. Salit | Shaun D. Jackman | Ali Bashir | Shilpa Garg | Michael C. Schatz | Tobias Marschall | Christopher E. Mason | Paul C. Boutros | George M. Church | Noah Spies | Noah Alexander | Can Alkan | Xian Fan | Jeremiah Wala | Stephen T. Sherry | Aaron M. Wenger | Adam C. English | John S. Oliver | Fritz J. Sedlazeck | Noushin Ghaffari | Nathan D. Olson | Justin M. Zook | Jeffrey A. Rosenfeld | Ian T. Fiddes | Nancy F. Hansen | James C. Mullikin | Camir Ricketts | Rick Tearle | John J. Farrell | Arda Soylev | Weichen Zhou | Ryan E. Mills | Jay M. Sage | Jennifer R. Davis | Michael D. Kaiser | Anthony P. Catalano | Mark J. P. Chaisson | Ken Chen | Andrew J. Carroll | Joyce Lee | Chunlin Xiao | Vincent Huang | Lesley M. Chapman | Sayed Mohammad Ebrahim Sahraeian | Alexandre Rouette | Alvaro Martinez Barrio | Oscar L. Rodriguez | Michael D. Kaiser | Mark Chaisson | M. Schatz | C. Alkan | S. Koren | A. Phillippy | G. Church | S. Sherry | J. Mullikin | C. Mason | Ken Chen | N. Hansen | R. Mills | J. Zook | M. Salit | I. Hajirasouliha | A. Bashir | P. Boutros | C. Xiao | S. Garg | F. Sedlazeck | Andrew Carroll | T. Marschall | R. Tearle | Joyce Lee | A. Wenger | J. Rosenfeld | S. M. Sahraeian | O. Rodriguez | Xian Fan | Noah Spies | Noah Alexander | J. Wala | Vincent Huang | Weichen Zhou | Camir Ricketts | A. English | N. Olson | N. Ghaffari | Arda Soylev | Jennifer R. Davis | Á. M. Barrio | J. Farrell | A. Rouette | J. Oliver | S. Jackman | Sergey Koren | Noushin Ghaffari | Chunlin Xiao

[1]  Noah Spies,et al.  svviz: a read viewer for validating structural variants , 2015, bioRxiv.

[2]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.

[3]  Michael C. Schatz,et al.  Accurate detection of complex structural variations using single molecule sequencing , 2017, Nature Methods.

[4]  Michael W. Lutz,et al.  Structural variants can be more informative for disease diagnostics, prognostics and translation than current SNP mapping and exon sequencing , 2016, Expert opinion on drug metabolism & toxicology.

[5]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[6]  John D McPherson,et al.  Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line , 2017, bioRxiv.

[7]  F. Balloux,et al.  Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast , 2016, Nature Communications.

[8]  Heng Li,et al.  FermiKit: assembly-based variant calling for Illumina resequencing data , 2015, Bioinform..

[9]  Xiaoyu Chen,et al.  Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications , 2016, Bioinform..

[10]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[11]  Daniel Blankenberg,et al.  SVCurator: A Crowdsourcing app to visualize evidence of structural variants for the human genome , 2019, bioRxiv.

[12]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[13]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[14]  H. Milting,et al.  Supplemental Material , 2004 .

[15]  Shilpa Garg,et al.  WhatsHap: fast and accurate read-based phasing , 2016, bioRxiv.

[16]  Mile Šikić,et al.  Edlib: a C/C++ library for fast, exact sequence alignment using edit distance , 2016 .

[17]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[18]  Marc Salit,et al.  Determining Performance Metrics for Targeted Next-Generation Sequencing Panels Using Reference Materials. , 2018, The Journal of molecular diagnostics : JMD.

[19]  Renke Pan,et al.  TNscope: Accurate Detection of Somatic Mutations with Haplotype-based Variant Candidate Detection and Machine Learning Filtering , 2018, bioRxiv.

[20]  Martin Sosic,et al.  Edlib: a C/C++ library for fast, exact sequence alignment using edit distance , 2016, bioRxiv.

[21]  T. Speed,et al.  GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. , 2017, Genome research.

[22]  Xin Li,et al.  The impact of structural variation on human gene expression , 2016, Nature Genetics.

[23]  Jonas Korlach,et al.  Discovery and genotyping of structural variation from long-read haploid genome sequence data , 2017, Genome research.

[24]  Brent S. Pedersen,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, Nature Biotechnology.

[25]  Serafim Batzoglou,et al.  Genome-wide reconstruction of complex structural variants using read clouds , 2016, Nature Methods.

[26]  Mark Gerstein,et al.  MetaSV: an accurate and integrative structural-variant caller for next generation sequencing , 2015, Bioinform..

[27]  Eric Vilain,et al.  Next-generation mapping: a novel approach for detection of pathogenic structural variants with a potential utility in clinical diagnosis , 2017, Genome Medicine.

[28]  Evan E. Eichler,et al.  Characterizing the Major Structural Variant Alleles of the Human Genome , 2019, Cell.

[29]  C. Nusbaum,et al.  Comprehensive variation discovery in single human genomes , 2014, Nature Genetics.

[30]  John S. Oliver,et al.  Automated Structural Variant Verification in Human Genomes using Single-Molecule Electronic DNA Mapping , 2017, bioRxiv.

[31]  S. Koren,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, bioRxiv.

[32]  Tam P. Sneddon,et al.  Long-read genome sequencing identifies causal structural variation in a Mendelian disease , 2017, Genetics in Medicine.

[33]  Kenny Q. Ye,et al.  Strong Association of De Novo Copy Number Mutations with Autism , 2007, Science.

[34]  Sergey Koren,et al.  De novo assembly of haplotype-resolved genomes with trio binning , 2018, Nature Biotechnology.

[35]  Wan-Ping Lee,et al.  Fast and accurate genomic analyses using genome graphs , 2019, Nature Genetics.

[36]  Euan A Ashley,et al.  A public resource facilitating clinical use of genomes , 2012, Proceedings of the National Academy of Sciences.

[37]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[38]  Hugo Y. K. Lam,et al.  Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods , 2015, Scientific Reports.

[39]  Kenny Q. Ye,et al.  Large-Scale Copy Number Polymorphism in the Human Genome , 2004, Science.

[40]  Ian T. Fiddes,et al.  Resolving the full spectrum of human genome variation using Linked-Reads , 2019, Genome research.

[41]  Sergey Koren,et al.  Highly-accurate long-read sequencing improves variant detection and assembly of a human genome , 2019, bioRxiv.

[42]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[43]  Iman Hajirasouliha,et al.  Characterization of segmental duplications and large inversions using Linked-Reads , 2018, bioRxiv.

[44]  Li Ding,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2018, Nature Communications.

[45]  Joachim Weischenfeldt,et al.  SvABA: genome-wide detection of structural variants and indels by local assembly , 2018, Genome research.

[46]  Birgit Funke,et al.  Best practices for benchmarking germline small-variant calls in human genomes , 2019, Nature Biotechnology.

[47]  Edwin Cuppen,et al.  Mapping and phasing of structural variation in patient genomes using nanopore sequencing , 2017, Nature Communications.

[48]  Marc L. Salit,et al.  svclassify: a method to establish benchmark structural variant calls , 2015 .

[49]  David M Kingsley,et al.  Characterization of a Human-Specific Tandem Repeat Associated with Bipolar Disorder and Schizophrenia. , 2018, American journal of human genetics.

[50]  Ken Chen,et al.  Combining accurate tumor genome simulation with crowdsourcing to benchmark somatic structural variant detection , 2018, bioRxiv.

[51]  Michael C. Schatz,et al.  Assemblytics: a web analytics tool for the detection of variants from an assembly , 2016, Bioinform..

[52]  John Wei,et al.  Towards a comprehensive structural variation map of an individual human genome , 2010, Genome Biology.

[53]  Alexander Hoischen,et al.  Long-Read Sequencing Emerging in Medical Genetics , 2019, Front. Genet..

[54]  Wolfgang Losert,et al.  svclassify: a method to establish benchmark structural variant calls , 2015, BMC Genomics.

[55]  M. Schatz,et al.  Accurate detection of de novo and transmitted indels within exome-capture data using micro-assembly , 2014, Nature Methods.

[56]  Chunlin Xiao,et al.  An open resource for accurately benchmarking small variant and reference calls , 2019, Nature Biotechnology.

[57]  J. Kitzman,et al.  Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing , 2009, Nature Genetics.

[58]  P. Kwok,et al.  Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly , 2012, Nature Biotechnology.

[59]  Can Alkan,et al.  Discovery of tandem and interspersed segmental duplications using high-throughput sequencing , 2019, Bioinform..

[60]  Shilpa Garg,et al.  Read-based phasing of related individuals , 2016, bioRxiv.

[61]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[62]  Ken Chen,et al.  HySA: A Hybrid Structural variant Assembly approach using next generation and single-molecule sequencing technologies , 2016, bioRxiv.

[63]  Dongmei Ai,et al.  SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution , 2018, bioRxiv.

[64]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[65]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[66]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.