GSDcreator: An Efficient and Comprehensive Simulator for Genarating NGS Data with Population Genetic Information

In recent decades, NGS data analysis has become a major research field in bioinformatics, which presents great advantages in many application scenarios. Many algorithms and software were designed for analyzing the NGS data, while simulation datasets are urgently needed for testing software and optimizing their parameter configurations. Thus, a series of NGS data simulators have been published. However, the existing simulators cannot satisfy the requirements from many specific scenarios. First, they do not support many newly discovered variations. Second, complex structural variations are difficult to generate. In addition, along with the increase of population data, it is urgent to increase population information simulation. In this paper, we propose GSDcreator, a comprehensive NGS simulator that overcome the three weaknesses mentioned above. It can produce all known types of variation, where the complex of variations are also supported. Furthermore, it can capture many important real data features including population polymorphism, insert size distribution, adjacent site depth distribution, overall depth distribution, quality score distribution, amplification bias, sequencing errors and so on. It's highlighted that 1000 Genomes Project Database is taken as a reference and integrates population genetic information to simulate population polymorphism. To test the performance, we did a lot of experiments and found that simulated data produced by GSDcreator are quit mimic to the real sequencing data.

[1]  Saurabh Gupta,et al.  SInC: an accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data , 2013, BMC Bioinformatics.

[2]  D. Posada,et al.  A comparison of tools for the simulation of genomic next-generation sequencing data , 2016, Nature Reviews Genetics.

[3]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[4]  M. Fava,et al.  Rare Copy Number Variation in Treatment-Resistant Major Depressive Disorder , 2014, Biological Psychiatry.

[5]  Yue Han,et al.  SeqMaker: A next generation sequencing simulator with variations, sequencing errors and amplification bias integrated , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[6]  Vinay Varadan,et al.  Effective normalization for copy number variation detection from whole genome sequencing , 2012, BMC Genomics.

[7]  Agus Salim,et al.  Statistical challenges associated with detecting copy number variations with next-generation sequencing , 2012, Bioinform..

[8]  A. Iafrate,et al.  Impact of EML4-ALK Variant on Resistance Mechanisms and Clinical Outcomes in ALK-Positive Lung Cancer. , 2018, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[9]  Shiva M. Singh,et al.  Copy Number Variation Distribution in Six Monozygotic Twin Pairs Discordant for Schizophrenia , 2014, Twin Research and Human Genetics.

[10]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[11]  Fredrik Lysholm,et al.  An efficient simulator of 454 data using configurable statistical models , 2011, BMC Research Notes.

[12]  Zhen Yue,et al.  pIRS: Profile-based Illumina pair-end reads simulator , 2012, Bioinform..

[13]  Emmanuel Barillot,et al.  Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization , 2010, Bioinform..

[14]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[15]  Paul Medvedev,et al.  Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.

[16]  F. Duffy,et al.  Copy number variation plays an important role in clinical epilepsy , 2014, Annals of neurology.

[17]  Gary D Bader,et al.  Functional impact of global rare copy number variation in autism spectrum disorders , 2010, Nature.

[18]  Inge Jonassen,et al.  Characteristics of 454 pyrosequencing data—enabling realistic simulation with flowsim , 2010, Bioinform..

[19]  Emmanuel Barillot,et al.  SV-Bay: structural variant detection in cancer genomes using a Bayesian approach with correction for GC-content and read mappability , 2016, Bioinform..

[20]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[21]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[22]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[23]  T. Thomas,et al.  GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[24]  C. Burks,et al.  Artificially generated data sets for testing DNA sequence assembly algorithms. , 1993, Genomics.