Combining accurate tumor genome simulation with crowdsourcing to benchmark somatic structural variant detection

BackgroundThe phenotypes of cancer cells are driven in part by somatic structural variants. Structural variants can initiate tumors, enhance their aggressiveness, and provide unique therapeutic opportunities. Whole-genome sequencing of tumors can allow exhaustive identification of the specific structural variants present in an individual cancer, facilitating both clinical diagnostics and the discovery of novel mutagenic mechanisms. A plethora of somatic structural variant detection algorithms have been created to enable these discoveries; however, there are no systematic benchmarks of them. Rigorous performance evaluation of somatic structural variant detection methods has been challenged by the lack of gold standards, extensive resource requirements, and difficulties arising from the need to share personal genomic information.ResultsTo facilitate structural variant detection algorithm evaluations, we create a robust simulation framework for somatic structural variants by extending the BAMSurgeon algorithm. We then organize and enable a crowdsourced benchmarking within the ICGC-TCGA DREAM Somatic Mutation Calling Challenge (SMC-DNA). We report here the results of structural variant benchmarking on three different tumors, comprising 204 submissions from 15 teams. In addition to ranking methods, we identify characteristic error profiles of individual algorithms and general trends across them. Surprisingly, we find that ensembles of analysis pipelines do not always outperform the best individual method, indicating a need for new ways to aggregate somatic structural variant detection approaches.ConclusionsThe synthetic tumors and somatic structural variant detection leaderboards remain available as a community benchmarking resource, and BAMSurgeon is available at https://github.com/adamewing/bamsurgeon.

[1]  Faraz Hach,et al.  Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery , 2010, Bioinform..

[2]  M. Peitsch,et al.  Verification of systems biology research in the age of collaborative competition , 2011, Nature Biotechnology.

[3]  Igor Jurisica,et al.  Tumour genomic and microenvironmental heterogeneity for integrated prediction of 5-year biochemical recurrence of prostate cancer: a retrospective cohort study. , 2014, The Lancet. Oncology.

[4]  Syed Haider,et al.  A bedr way of genomic interval processing , 2016, Source Code for Biology and Medicine.

[5]  Andrea Califano,et al.  Toward better benchmarking: challenge-based methods assessment in cancer genomics , 2014, Genome Biology.

[6]  Xiaoyu Chen,et al.  Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications , 2016, Bioinform..

[7]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[8]  Gary D Bader,et al.  Enhancer hijacking activates GFI1 family oncogenes in medulloblastoma , 2014, Nature.

[9]  Paul Medvedev,et al.  Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.

[10]  Adam A. Margolin,et al.  Systematic Analysis of Challenge-Driven Improvements in Molecular Prognostic Models for Breast Cancer , 2013, Science Translational Medicine.

[11]  R. Houlston,et al.  Generation of Artificial FASTQ Files to Evaluate the Performance of Next-Generation Sequencing Pipelines , 2012, PloS one.

[12]  A. Magi,et al.  Detection of Genomic Structural Variants from Next-Generation Sequencing Data , 2015, Front. Bioeng. Biotechnol..

[13]  Mark Gerstein,et al.  MetaSV: an accurate and integrative structural-variant caller for next generation sequencing , 2015, Bioinform..

[14]  J. Williams Challenge! , 1978, British journal of sports medicine.

[15]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[16]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[17]  Ewan Birney,et al.  Automated generation of heuristics for biological sequence comparison , 2005, BMC Bioinformatics.

[18]  Michael C. Rusch,et al.  CREST maps somatic structural variation in cancer genomes with base-pair resolution , 2011, Nature Methods.

[19]  A. Børresen-Dale,et al.  A tumor DNA complex aberration index is an independent predictor of survival in breast and ovarian cancer , 2014, Molecular oncology.

[20]  L. Ding,et al.  novoBreak: local assembly for breakpoint detection in cancer genomes , 2016, Nature Methods.

[21]  Adam M. Phillippy,et al.  Comparative genome assembly , 2004, Briefings Bioinform..

[22]  Joshua M. Stuart,et al.  Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection , 2015, Nature Methods.

[23]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[24]  Zhen-yi Wang,et al.  Use of all-trans retinoic acid in the treatment of acute promyelocytic leukemia. , 1988, Haematology and blood transfusion.

[25]  Daniel R. Zerbino,et al.  Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler , 2009, PloS one.

[26]  P. Leder,et al.  Translocation of the c-myc gene into the immunoglobulin heavy chain locus in human Burkitt lymphoma and murine plasmacytoma cells. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[27]  T. Fennell,et al.  Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries , 2011, Genome Biology.

[28]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[29]  Diogo M. Camacho,et al.  Wisdom of crowds for robust gene network inference , 2012, Nature Methods.

[30]  H. Kuhn The Hungarian method for the assignment problem , 1955 .