sv-callers: a highly portable parallel workflow for structural variant detection in whole-genome sequence data

Structural variants (SVs) are an important class of genetic variation implicated in a wide array of genetic diseases including cancer. Despite the advances in whole genome sequencing, comprehensive and accurate detection of SVs in short-read data still poses some practical and computational challenges. We present sv-callers, a highly portable workflow that enables parallel execution of multiple SV detection tools, as well as provide users with example analyses of detected SV callsets in a Jupyter Notebook. This workflow supports easy deployment of software dependencies, configuration and addition of new analysis tools. Moreover, porting it to different computing systems requires minimal effort. Finally, we demonstrate the utility of the workflow by performing both somatic and germline SV analyses on different high-performance computing systems.

[1]  Mark Gerstein,et al.  FusorSV: an algorithm for optimally combining data from multiple structural variation detection methods , 2018, Genome biology.

[2]  Galina Korsunsky Xenon , 2015, International anesthesiology clinics.

[3]  Xiaoyu Chen,et al.  Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications , 2016, Bioinform..

[4]  T. Speed,et al.  GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. , 2017, Genome research.

[5]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[6]  Mary Goldman,et al.  Toil enables reproducible, open source, big biomedical data analyses , 2017, Nature Biotechnology.

[7]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[8]  Jason Maassen,et al.  A Portable and Scalable Workflow for Detecting Structural Variants in Whole-Genome Sequencing Data , 2018, 2018 IEEE 14th International Conference on e-Science (e-Science).

[9]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[10]  Daniel S. Katz,et al.  Four simple recommendations to encourage best practices in research software , 2017, F1000Research.

[11]  Y. Kamatani,et al.  Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing , 2019, Genome Biology.

[12]  H. Milting,et al.  Supplemental Material , 2004 .

[13]  Xin Zhou,et al.  Pan-cancer genome and transcriptome analyses of 1,699 pediatric leukemias and solid tumors , 2018, Nature.

[14]  John Chilton,et al.  Common Workflow Language, v1.0 , 2016 .

[15]  Wolfgang Losert,et al.  svclassify: a method to establish benchmark structural variant calls , 2015, BMC Genomics.

[16]  Sven Rahmann,et al.  Genome analysis , 2022 .

[17]  Harald Barsnes,et al.  BioContainers: an open-source and community-driven framework for software standardization , 2017, Bioinform..

[18]  Alexander Lex,et al.  UpSetR: an R package for the visualization of intersecting sets and their properties , 2017, bioRxiv.

[19]  Henri E. Bal,et al.  User-friendly and reliable grid computing based on imperfect middleware , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[20]  Alexander Lex,et al.  UpSetR: An R Package for the Visualization of Intersecting Sets and their Properties , 2017 .

[21]  Christopher J. Mungall,et al.  BioMake: a GNU Make-compatible utility for declarative workflow management , 2016 .

[22]  Jason Maassen,et al.  Track 2 Lightning Talk : Software Development Best Practices at the Netherlands eScience Center , 2017 .

[23]  Li Fang,et al.  NextSV: a meta-caller for structural variants from low-coverage long-read sequencing data , 2018 .

[24]  Wan Choi,et al.  Large-Scale Uniform Analysis of Cancer Whole Genomes in Multiple Computing Environments , 2017, bioRxiv.

[25]  Gregory V. Wilson,et al.  Four simple recommendations to encourage best practices in research software [version 1; referees: 3 approved] , 2017 .

[26]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.

[27]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[28]  Andreas Haas,et al.  Standardization of an API for Distributed Resource Management Systems , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[29]  M. Stratton Exploring the Genomes of Cancer Cells: Progress and Promise , 2011, Science.

[30]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[31]  Shantenu Jha,et al.  SAGA: A standardized access layer to heterogeneous Distributed Computing Infrastructure , 2015 .

[32]  Guusje Bonnema,et al.  Making the difference: integrating structural variation detection tools , 2015, Briefings Bioinform..

[33]  Justin M. Zook Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015 .

[34]  Mark Gerstein,et al.  MetaSV: an accurate and integrative structural-variant caller for next generation sequencing , 2015, Bioinform..

[35]  Michael C. Heinold,et al.  The landscape of genomic alterations across childhood cancers , 2018, Nature.

[36]  Cees T. A. M. de Laat,et al.  A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term , 2016, Computer.

[37]  F. Balloux,et al.  Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast , 2016, Nature Communications.

[38]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.