Methods developed during the first National Center for Biotechnology Information Structural Variation Codeathon at Baylor College of Medicine

In October 2019, 46 scientists from around the world participated in the first National Center for Biotechnology Information (NCBI) Structural Variation (SV) Codeathon at Baylor College of Medicine. The charge of this first annual working session was to identify ongoing challenges around the topics of SV and graph genomes, and in response to design reliable methods to facilitate their study. Over three days, seven working groups each designed and developed new open-sourced methods to improve the bioinformatic analysis of genomic SVs represented in next-generation sequencing (NGS) data. The groups’ approaches addressed a wide range of problems in SV detection and analysis, including quality control (QC) assessments of metagenome assemblies and population-scale VCF files, de novo copy number variation (CNV) detection based on continuous long sequence reads, the representation of sequence variation using graph genomes, and the development of an SV annotation pipeline. A summary of the questions and developments that arose during the daily discussions between groups is outlined. The new methods are publicly available at https://github.com/NCBI-Codeathons/ , and demonstrate that a codeathon devoted to SV analysis can produce valuable new insights both for participants and for the broader research community.

[1]  Michael C. Schatz,et al.  Accurate detection of complex structural variations using single molecule sequencing , 2017, Nature Methods.

[2]  Stephen J. Guter,et al.  Convergence of Genes and Cellular Pathways Dysregulated in Autism Spectrum Disorders , 2014, American journal of human genetics.

[3]  C. Quince,et al.  Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities. , 2013, Environmental microbiology.

[4]  B. Nowakowska,et al.  Clinical interpretation of copy number variants in the human genome , 2017, Journal of Applied Genetics.

[5]  Suzanna E Lewis,et al.  JBrowse: a dynamic web platform for genome visualization and analysis , 2016, Genome Biology.

[6]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[7]  Xiaoyu Chen,et al.  Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications , 2016, Bioinform..

[8]  Adam Ameur,et al.  Goodbye reference, hello genome graphs , 2019, Nature Biotechnology.

[9]  Guillaume Holley,et al.  Long read sequencing of 1,817 Icelanders provides insight into the role of structural variants in human disease , 2019, bioRxiv.

[10]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.

[11]  J. R. MacDonald,et al.  A copy number variation map of the human genome , 2015, Nature Reviews Genetics.

[12]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[13]  F. Balloux,et al.  Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast , 2016, Nature Communications.

[14]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[15]  J. Hadfield,et al.  RNA sequencing: the teenage years , 2019, Nature Reviews Genetics.

[16]  Ben Busby,et al.  DangerTrack: A scoring system to detect difficult-to-assess regions , 2017, F1000Research.

[17]  Alexey A. Gurevich,et al.  MetaQUAST: evaluation of metagenome assemblies , 2016, Bioinform..

[18]  William Jones,et al.  Variation graph toolkit improves read mapping by representing genetic variation in the reference , 2018, Nature Biotechnology.

[19]  Fritz J Sedlazeck,et al.  Piercing the dark matter: bioinformatics of long-range sequencing and mapping , 2018, Nature Reviews Genetics.

[20]  Ethan Cerami,et al.  Abstract 954: The landscape of kinase fusions in cancer , 2015 .

[21]  Wan-Ping Lee,et al.  Fast and accurate genomic analyses using genome graphs , 2019, Nature Genetics.

[22]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010 .

[23]  Benedict Paten,et al.  Genotyping structural variants in pangenome graphs using the vg toolkit , 2020, Genome Biology.

[24]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[25]  Chris Bizon,et al.  Increasing the diagnostic yield of exome sequencing by copy number variant analysis , 2018, PloS one.

[26]  J. Shendure,et al.  DNA sequencing at 40: past, present and future , 2017, Nature.

[27]  Ryan E. Mills,et al.  Structural variation in the sequencing era , 2019, Nature Reviews Genetics.

[28]  Patrick Dowd,et al.  Confirmation of BRCA1 by analysis of germline mutations linked to breast and ovarian cancer in ten families , 1994, Nature Genetics.

[29]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[30]  Jordan M. Eizenga,et al.  Genome graphs and the evolution of genome inference , 2017, bioRxiv.

[31]  Young Seok Ju,et al.  Patterns and mechanisms of structural variations in human cancer , 2018, Experimental & Molecular Medicine.

[32]  Michael C. Schatz,et al.  Paragraph: a graph-based structural variant genotyper for short-read sequence data , 2019, Genome Biology.

[33]  Christophe Dessimoz,et al.  Structural variant calling: the long and the short of it , 2019, Genome Biology.

[34]  Andrew R. Webster,et al.  Complex structural variants in Mendelian disorders: identification and breakpoint resolution using short- and long-read genome sequencing , 2018, Genome Medicine.

[35]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[36]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[37]  R. Durbin,et al.  Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly , 2016, bioRxiv.

[38]  Véronique Geoffroy,et al.  AnnotSV: an integrated tool for structural variations annotation , 2018, Bioinform..

[39]  R. Wilson,et al.  Modernizing Reference Genome Assemblies , 2011, PLoS biology.

[40]  Nuno A. Fonseca,et al.  Tools for mapping high-throughput sequencing data , 2012, Bioinform..

[41]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[42]  J. Lupski,et al.  Mechanisms underlying structural variant formation in genomic disorders , 2016, Nature Reviews Genetics.

[43]  Whitney Whitford,et al.  Identifying, understanding, and correcting technical artifacts on the sex chromosomes in next-generation sequencing data , 2019, GigaScience.

[44]  J. Sebat,et al.  CNVs: Harbingers of a Rare Variant Revolution in Psychiatric Genetics , 2012, Cell.

[45]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[46]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[47]  Brent S. Pedersen,et al.  Mosdepth: quick coverage calculation for genomes and exomes , 2017, bioRxiv.

[48]  Indraniel Das,et al.  svtools: population-scale analysis of structural variation , 2019, Bioinform..

[49]  P. Elliott,et al.  Mirror extreme BMI phenotypes associated with gene dosage at the chromosome 16p11.2 locus , 2011, Nature.

[50]  P. Pevzner,et al.  metaSPAdes: a new versatile metagenomic assembler. , 2017, Genome research.

[51]  Shuiquan Tang,et al.  Ultra-deep, long-read nanopore sequencing of mock microbial community standards , 2018, bioRxiv.