FIGG: Simulating populations of whole genome sequences for heterogeneous data analyses

BackgroundHigh-throughput sequencing has become one of the primary tools for investigation of the molecular basis of disease. The increasing use of sequencing in investigations that aim to understand both individuals and populations is challenging our ability to develop analysis tools that scale with the data. This issue is of particular concern in studies that exhibit a wide degree of heterogeneity or deviation from the standard reference genome. The advent of population scale sequencing studies requires analysis tools that are developed and tested against matching quantities of heterogeneous data.ResultsWe developed a large-scale whole genome simulation tool, FIGG, which generates large numbers of whole genomes with known sequence characteristics based on direct sampling of experimentally known or theorized variations. For normal variations we used publicly available data to determine the frequency of different mutation classes across the genome. FIGG then uses this information as a background to generate new sequences from a parent sequence with matching frequencies, but different actual mutations. The background can be normal variations, known disease variations, or a theoretical frequency distribution of variations.ConclusionIn order to enable the creation of large numbers of genomes, FIGG generates simulated sequences from known genomic variation and iteratively mutates each genome separately. The result is multiple whole genome sequences with unique variations that can primarily be used to provide different reference genomes, model heterogeneous populations, and can offer a standard test environment for new analysis algorithms or bioinformatics tools.

[1]  Joshua M. Korn,et al.  Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2008, Nature.

[2]  Chun Li,et al.  GWAsimulator: a rapid whole-genome simulation program , 2007, Bioinform..

[3]  B. Scheithauer,et al.  Alterations of chromosome arms 1p and 19q as predictors of survival in oligodendrogliomas, astrocytomas, and mixed oligoastrocytomas. , 2000, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[4]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[5]  Martin J Lercher,et al.  Human SNP variability and mutation rate are higher in regions of high recombination. , 2002, Trends in genetics : TIG.

[6]  Oscar E. Gaggiotti,et al.  Computer simulations: tools for population and evolutionary genetics , 2012, Nature Reviews Genetics.

[7]  Gonçalo R. Abecasis,et al.  GENOME: a rapid coalescent-based whole genome simulator , 2007, Bioinform..

[8]  P. Tam The International HapMap Consortium. The International HapMap Project (Co-PI of Hong Kong Centre which responsible for 2.5% of genome) , 2003 .

[9]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[10]  J. Limon,et al.  Loss of genetic material within 1p and 19q chromosomal arms in low grade gliomas of central nervous system. , 2013, Folia neuropathologica.

[11]  B. Barrell,et al.  Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence , 1998, Nature.

[12]  Leszek Lipinski,et al.  Gene conversion and GC-content evolution in mammalian Hsp70. , 2004, Molecular biology and evolution.

[13]  Marylyn D. Ritchie,et al.  Generating Linkage Disequilibrium Patterns in Data Simulations Using genomeSIMLA , 2008, EvoBIO.

[14]  Amy E. Hawkins,et al.  DNA sequencing of a cytogenetically normal acute myeloid leukemia genome , 2008, Nature.

[15]  G. Gonnet,et al.  ALF—A Simulation Framework for Genome Evolution , 2011, Molecular biology and evolution.

[16]  Martin C. Frith,et al.  An approximate Bayesian approach for mapping paired-end DNA reads to a reference genome , 2013, Bioinform..

[17]  Jonathan E. Allen,et al.  Genome sequence of the human malaria parasite Plasmodium falciparum , 2002, Nature.

[18]  C. Hoggart,et al.  Sequence-Level Population Simulations Over Large Genomic Regions , 2007, Genetics.

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20]  M. Stratton,et al.  The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website , 2004, British Journal of Cancer.

[21]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[22]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[23]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[24]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[25]  M. Pop,et al.  Metagenomic Analysis of the Human Distal Gut Microbiome , 2006, Science.

[26]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[27]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[28]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[29]  Paul Medvedev,et al.  Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.

[30]  O. Griffith,et al.  Mitelman Database (Chromosome Aberrations and Gene Fusions in Cancer) , 2014 .

[31]  John Boyle,et al.  SAMQA: error classification and validation of high-throughput sequenced read data , 2011, BMC Genomics.

[32]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[33]  Daniel H. Huson,et al.  48. MetaSim: A Sequencing Simulator for Genomics and Metagenomics , 2011 .

[34]  F. Mitelman,et al.  Mitelman database of chromosome aberrations and gene fusions in cancer , 2014 .

[35]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[36]  Laurent Gil,et al.  Ensembl variation resources , 2010, BMC Genomics.

[37]  Antony V. Cox,et al.  Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing , 2008, Nature Genetics.