SomatoSim: precision simulation of somatic single nucleotide variants

Background Somatic single nucleotide variants have gained increased attention because of their role in cancer development and the widespread use of high-throughput sequencing techniques. The necessity to accurately identify these variants in sequencing data has led to a proliferation of somatic variant calling tools. Additionally, the use of simulated data to assess the performance of these tools has become common practice, as there is no gold standard dataset for benchmarking performance. However, many existing somatic variant simulation tools are limited because they rely on generating entirely synthetic reads derived from a reference genome or because they do not allow for the precise customizability that would enable a more focused understanding of single nucleotide variant calling performance. Results SomatoSim is a tool that lets users simulate somatic single nucleotide variants in sequence alignment map (SAM/BAM) files with full control of the specific variant positions, number of variants, variant allele fractions, depth of coverage, read quality, and base quality, among other parameters. SomatoSim accomplishes this through a three-stage process: variant selection, where candidate positions are selected for simulation, variant simulation, where reads are selected and mutated, and variant evaluation, where SomatoSim summarizes the simulation results. Conclusions SomatoSim is a user-friendly tool that offers a high level of customizability for simulating somatic single nucleotide variants. SomatoSim is available at https://github.com/BieseckerLab/SomatoSim .

[1]  Leslie G. Biesecker,et al.  A genomic view of mosaicism and human disease , 2013, Nature Reviews Genetics.

[2]  Mark Gerstein,et al.  VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications , 2014, Bioinform..

[3]  Shicai Wang,et al.  COSMIC: the Catalogue Of Somatic Mutations In Cancer , 2018, Nucleic Acids Res..

[4]  Saurabh Gupta,et al.  SInC: an accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data , 2013, BMC Bioinformatics.

[5]  Mark Gerstein,et al.  GENCODE reference annotation for the human and mouse genomes , 2018, Nucleic Acids Res..

[6]  Joshua M. Stuart,et al.  Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection , 2015, Nature Methods.

[7]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[8]  Anthony J. Cox,et al.  tHapMix: simulating tumour samples through haplotype mixtures , 2016, bioRxiv.

[9]  Stephen R Quake,et al.  Whole-genome molecular haplotyping of single cells , 2011, Nature Biotechnology.

[10]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[11]  T. Thomas,et al.  GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[12]  Roberto Semeraro,et al.  Xome-Blender: A novel cancer genome simulator , 2018, PloS one.

[13]  John C. Marioni,et al.  Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data , 2009, Bioinform..

[14]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[15]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[16]  V. Bansal,et al.  The importance of phase information for human genomics , 2011, Nature Reviews Genetics.

[17]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[18]  A. Wilm,et al.  LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets , 2012, Nucleic acids research.

[19]  Chang Xu,et al.  A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data , 2018, Computational and structural biotechnology journal.