VarBen: Generating in silico Reference Datasets for Clinical Next-Generation Sequencing Bioinformatics Pipeline Evaluation.

Next-generation sequencing (NGS) is increasingly being adopted as a valuable method for the detection of somatic variants in clinical oncology. However, it is still challenging to reach a satisfactory level of robustness and standardization in clinical practice when using the currently available bioinformatics pipelines to detect variants from raw sequencing data. Moreover, appropriate reference datasets are lacking for clinical bioinformatics pipeline development, validation and proficiency testing. Here, we developed VarBen, an open-source software for variant simulation to generate customized reference datasets by directly editing the original sequencing reads. VarBen can introduce a variety of variants, including single-nucleotide variants, small insertions and deletions, and large structural variants, into targeted, exome or whole-genome sequencing data, and can handle sequencing data from both the Illumina and Ion Torrent sequencing platforms. To demonstrate the feasibility and robustness of VarBen, we performed variant simulation on different sequencing datasets and compared the simulated variants with real-world data. The validation study showed that the simulated data is highly comparable to real-world data and that VarBen is a reliable tool for variant simulation. In addition, our collaborative study of somatic variant calling in 20 laboratories emphasizes the need for laboratories to evaluate their bioinformatics pipelines with customized reference datasets. We expect VarBen could help users develop and validate their bioinformatics pipelines using locally generated sequencing data.

[1]  Li Ding,et al.  Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. , 2018, Cell systems.

[2]  Heng Li,et al.  Toward better understanding of artifacts in variant calling from high-coverage samples , 2014, Bioinform..

[3]  D. Posada,et al.  A comparison of tools for the simulation of genomic next-generation sequencing data , 2016, Nature Reviews Genetics.

[4]  H. Hakonarson,et al.  Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing , 2013, Genome Medicine.

[5]  Sabah Kadri,et al.  insiM: in silico Mutator Software for Bioinformatics Pipeline Validation of Clinical Next-Generation Sequencing Assays. , 2019, The Journal of molecular diagnostics : JMD.

[6]  J. Troge,et al.  Inferring tumor progression from genomic heterogeneity. , 2010, Genome research.

[7]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[8]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[9]  Xiaoyu Chen,et al.  Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications , 2016, Bioinform..

[10]  John D Pfeifer,et al.  A Model Study of In Silico Proficiency Testing for Clinical Next-Generation Sequencing. , 2016, Archives of pathology & laboratory medicine.

[11]  Peiyong Guan,et al.  Structural variation detection using next-generation sequencing data: A comparative technical review. , 2016, Methods.

[12]  Paul Medvedev,et al.  Using state machines to model the Ion Torrent sequencing process and to improve read error rates , 2013, Bioinform..

[13]  Mads Thomassen,et al.  Evaluation of Nine Somatic Variant Callers for Detection of Somatic Mutations in Exome and Targeted Deep Sequencing Data , 2016, PloS one.

[14]  Dahui Qin,et al.  Multi-Institutional FASTQ File Exchange as a Means of Proficiency Testing for Next-Generation Sequencing Bioinformatics and Variant Interpretation. , 2016, The Journal of molecular diagnostics : JMD.

[15]  Chang Xu,et al.  A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data , 2018, Computational and structural biotechnology journal.

[16]  Alexis B. Carter,et al.  Standards and Guidelines for Validating Next-Generation Sequencing Bioinformatics Pipelines: A Joint Recommendation of the Association for Molecular Pathology and the College of American Pathologists. , 2018, The Journal of molecular diagnostics : JMD.

[17]  Christopher T. Saunders,et al.  Strelka2: fast and accurate calling of germline and somatic variants , 2018, Nature Methods.

[18]  Dongmei Ai,et al.  SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution , 2018, bioRxiv.

[19]  Vineet Bafna,et al.  Wessim: a whole-exome sequencing simulator based on in silico exome capture , 2013, Bioinform..

[20]  Thomas Schneider,et al.  Validation of a Customized Bioinformatics Pipeline for a Clinical Next-Generation Sequencing Test Targeting Solid Tumor-Associated Variants. , 2018, The Journal of molecular diagnostics : JMD.

[21]  Hugo Y. K. Lam,et al.  An ensemble approach to accurately detect somatic mutations using SomaticSeq , 2015, Genome Biology.

[22]  Bernard P. Puc,et al.  An integrated semiconductor device enabling non-optical genome sequencing , 2011, Nature.

[23]  Richard W Tothill,et al.  Next-generation sequencing for cancer diagnostics: a practical perspective. , 2011, The Clinical biochemist. Reviews.

[24]  R. Wilson,et al.  The Next-Generation Sequencing Revolution and Its Impact on Genomics , 2013, Cell.

[25]  Robin D Harrington,et al.  Plasmid-Based Materials as Multiplex Quality Controls and Calibrators for Clinical Next-Generation Sequencing Assays. , 2016, The Journal of molecular diagnostics : JMD.

[26]  Joshua F. McMichael,et al.  Systematic Discovery of Complex Indels in Human Cancers , 2015, Nature medicine.

[27]  Mark Gerstein,et al.  VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications , 2014, Bioinform..

[28]  Michael C. Heinold,et al.  A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing , 2015, Nature Communications.

[29]  David Haussler,et al.  The UCSC genome browser and associated tools , 2012, Briefings Bioinform..

[30]  Michael F Berger,et al.  Clinical tumor sequencing: opportunities and challenges for precision cancer medicine. , 2015, American Society of Clinical Oncology educational book. American Society of Clinical Oncology. Annual Meeting.

[31]  Brandi L. Cantarel,et al.  BAYSIC: a Bayesian method for combining sets of genome variants with improved specificity and sensitivity , 2014, BMC Bioinformatics.

[32]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[33]  Michael R. Speicher,et al.  A survey of tools for variant analysis of next-generation genome sequencing data , 2013, Briefings Bioinform..

[34]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[35]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[36]  Rui Zhang,et al.  The reliable assurance of detecting somatic mutations in cancer-related genes by next-generation sequencing: the results of external quality assessment in China , 2016, Oncotarget.

[37]  Yun Liu,et al.  SVmine improves structural variation detection by integrative mining of predictions from multiple algorithms , 2017, Bioinform..

[38]  Tingting Jiang,et al.  Reliability of Whole-Exome Sequencing for Assessing Intratumor Genetic Heterogeneity , 2018, bioRxiv.

[39]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[40]  L. Ding,et al.  novoBreak: local assembly for breakpoint detection in cancer genomes , 2016, Nature Methods.

[41]  N. Lennon,et al.  Characterizing and measuring bias in sequence data , 2013, Genome Biology.

[42]  Ken Chen,et al.  Combining accurate tumor genome simulation with crowdsourcing to benchmark somatic structural variant detection , 2018, bioRxiv.

[43]  Terence P. Speed,et al.  Comparing somatic mutation-callers: beyond Venn diagrams , 2013, BMC Bioinformatics.

[44]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[45]  Joshua M. Stuart,et al.  Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection , 2015, Nature Methods.

[46]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.