A broad survey of DNA sequence data simulation tools.

In silico DNA sequence generation is a powerful technology to evaluate and validate bioinformatics tools, and accordingly more than 35 DNA sequence simulation tools have been developed. With such a diverse array of tools to choose from, an important question is: Which tool should be used for a desired outcome? This question is largely unanswered as documentation for many of these DNA simulation tools is sparse. To address this, we performed a review of DNA sequence simulation tools developed to date and evaluated 20 state-of-art DNA sequence simulation tools on their ability to produce accurate reads based on their implemented sequence error model. We provide a succinct description of each tool and suggest which tool is most appropriate for the given different scenarios. Given the multitude of similar yet non-identical tools, researchers can use this review as a guide to inform their choice of DNA sequence simulation tool. This paves the way towards assessing existing tools in a unified framework, as well as enabling different simulation scenario analysis within the same framework.

[1]  Ravishankar K. Iyer,et al.  Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models , 2016, PloS one.

[2]  Justin Chu,et al.  NanoSim: nanopore sequence read simulator based on statistical characterization , 2016, bioRxiv.

[3]  Marghoob Mohiyuddin,et al.  LongISLND: in silico sequencing of lengthy and noisy datatypes , 2016, Bioinform..

[4]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[5]  Yun Liu,et al.  Pysim-sv: a package for simulating structural variation data with GC-biases , 2017, BMC Bioinformatics.

[6]  Fredrik Lysholm,et al.  An efficient simulator of 454 data using configurable statistical models , 2011, BMC Research Notes.

[7]  Inge Jonassen,et al.  Characteristics of 454 pyrosequencing data—enabling realistic simulation with flowsim , 2010, Bioinform..

[8]  R. Houlston,et al.  Generation of Artificial FASTQ Files to Evaluate the Performance of Next-Generation Sequencing Pipelines , 2012, PloS one.

[9]  Cynthia Gibas,et al.  Simulome: a genome sequence and variant simulator , 2017, Bioinform..

[10]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[11]  Liying Yang,et al.  IntSIM: An Integrated Simulator of Next-Generation Sequencing Data , 2017, IEEE Transactions on Biomedical Engineering.

[12]  Michael C. Jewett,et al.  NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents , 2016, PLoS Comput. Biol..

[13]  David Posada,et al.  NGSphy: phylogenomic simulation of next‐generation sequencing data , 2018, Bioinform..

[14]  Anna Shcherbina,et al.  FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets , 2014, BMC Research Notes.

[15]  T. Thomas,et al.  GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[16]  Eske Willerslev,et al.  gargammel: a sequence simulator for ancient DNA , 2016, Bioinform..

[17]  Vineet Bafna,et al.  Wessim: a whole-exome sequencing simulator based on in silico exome capture , 2013, Bioinform..

[18]  D. Posada,et al.  A comparison of tools for the simulation of genomic next-generation sequencing data , 2016, Nature Reviews Genetics.

[19]  Shao-Wu Zhang,et al.  NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model , 2018, BMC Bioinformatics.

[20]  Shoshana Marcus,et al.  Error correction and assembly complexity of single molecule sequencing reads , 2014, bioRxiv.

[21]  S. Caboche,et al.  Comparison of mapping algorithms used in high-throughput sequencing: application to Ion Torrent data , 2014, BMC Genomics.

[22]  Ruth E. Timme,et al.  TreeToReads - a pipeline for simulating raw reads from phylogenies , 2016, BMC Bioinformatics.

[23]  Saurabh Gupta,et al.  SInC: an accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data , 2013, BMC Bioinformatics.

[24]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[25]  Florent E. Angly,et al.  Grinder: a versatile amplicon and shotgun sequence simulator , 2012, Nucleic acids research.

[26]  Manuel Holtgrewe,et al.  Mason – A Read Simulator for Second Generation Sequencing Data , 2010 .

[27]  Hwan-Gue Cho,et al.  FASIM : Fragments assembly simulation using biased-sampling model and assembly simulation for microbial genome shotgun sequencing , 2006 .

[28]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[29]  Michael C. Schatz,et al.  LRSim: A Linked-Reads Simulator Generating Insights for Better Genome Partitioning , 2017, Computational and structural biotechnology journal.

[30]  Sven Rahmann,et al.  SimLoRD: Simulation of Long Read Data , 2016, Bioinform..

[31]  C. Burks,et al.  Artificially generated data sets for testing DNA sequence assembly algorithms. , 1993, Genomics.

[32]  Jinyang Zhao,et al.  Genome sequencing of the sweetpotato whitefly Bemisia tabaci MED/Q , 2017, GigaScience.

[33]  Sarah A. Killcoyne,et al.  FIGG: Simulating populations of whole genome sequences for heterogeneous data analyses , 2013, BMC Bioinformatics.

[34]  Chaochun Wei,et al.  NeSSM: A Next-Generation Sequencing Simulator for Metagenomics , 2013, PloS one.

[35]  Yadong Wang,et al.  Pysubsim-tree: A package for simulating tumor genomes according to tumor evolution history , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[36]  Tiziana Margaria,et al.  Semantics-based composition of EMBOSS services , 2011, J. Biomed. Semant..

[37]  Lachlan James M. Coin,et al.  Simulating the dynamics of targeted capture sequencing with CapSim , 2017, bioRxiv.

[38]  Markus Boenn,et al.  ShRangeSim: Simulation of Single Nucleotide Polymorphism Clusters in Next-Generation Sequencing Data , 2018, J. Comput. Biol..

[39]  L. Excoffier,et al.  A simulated annealing approach to define the genetic structure of populations , 2002, Molecular ecology.

[40]  Gary K. Chen,et al.  Fast and flexible simulation of DNA sequence data. , 2008, Genome research.

[41]  Diogo Pratas,et al.  XS: a FASTQ read simulator , 2014, BMC Research Notes.

[42]  Dan Nettleton,et al.  SimSeq: a nonparametric approach to simulation of RNA-sequence datasets , 2015, Bioinform..

[43]  Sara Goodwin,et al.  SiLiCO: A Simulator of Long Read Sequencing in PacBio and Oxford Nanopore , 2016, bioRxiv.

[44]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[45]  Yue Han,et al.  SeqMaker: A next generation sequencing simulator with variations, sequencing errors and amplification bias integrated , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[46]  Zhen Yue,et al.  pIRS: Profile-based Illumina pair-end reads simulator , 2012, Bioinform..

[47]  Jeffrey R. Long,et al.  A better sequence-read simulator program for metagenomics , 2014, BMC Bioinformatics.