Genome simulation approaches for synthesizing in silico datasets for human genomics.

Simulated data is a necessary first step in the evaluation of new analytic methods because in simulated data the true effects are known. To successfully develop novel statistical and computational methods for genetic analysis, it is vital to simulate datasets consisting of single nucleotide polymorphisms (SNPs) spread throughout the genome at a density similar to that observed by new high-throughput molecular genomics studies. In addition, the simulation of environmental data and effects will be essential to properly formulate risk models for complex disorders. Data simulations are often criticized because they are much less noisy than natural biological data, as it is nearly impossible to simulate the multitude of possible sources of natural and experimental variability. However, simulating data in silico is the most straightforward way to test the true potential of new methods during development. Thus, advances that increase the complexity of data simulations will permit investigators to better assess new analytical methods. In this work, we will briefly describe some of the current approaches for the simulation of human genomics data describing the advantages and disadvantages of the various approaches. We will also include details on software packages available for data simulation. Finally, we will expand upon one particular approach for the creation of complex, human genomic datasets that uses a forward-time population simulation algorithm: genomeSIMLA. Many of the hallmark features of biological datasets can be synthesized in silico; still much research is needed to enhance our capabilities to create datasets that capture the natural complexity of biological datasets.

[1]  Jurg Ott,et al.  Sum statistics for the joint detection of multiple disease loci in case‐control association studies with SNP markers , 2003, Genetic epidemiology.

[2]  Marylyn D. Ritchie,et al.  Data Simulation Software for Whole-Genome Association and Other Studies in Human Genetics , 2005, Pacific Symposium on Biocomputing.

[3]  Gary K. Chen,et al.  Fast and flexible simulation of DNA sequence data. , 2008, Genome research.

[4]  Chun Li,et al.  GWAsimulator: a rapid whole-genome simulation program , 2007, Bioinform..

[5]  Antonio Carvajal-Rodríguez,et al.  GENOMEPOP: A program to simulate genomes in populations , 2008, BMC Bioinformatics.

[6]  M. P. Bass,et al.  Pedigree Generation for Analysis of Genetic Linkage and Association , 2003, Pacific Symposium on Biocomputing.

[7]  Jason H. Moore,et al.  Routine discovery of complex genetic models using genetic algorithms , 2004, Appl. Soft Comput..

[8]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[9]  David M. Reif,et al.  Novel methods for detecting epistasis in pharmacogenomics studies. , 2007, Pharmacogenomics.

[10]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[11]  Mike Schmidt,et al.  Statistical Applications in Genetics and Molecular Biology Extension of the SIMLA Package for Generating Pedigrees with Complex Inheritance Patterns : Environmental Covariates , Gene-Gene and Gene-Environment Interaction , 2011 .

[12]  F. J. Richards A Flexible Growth Function for Empirical Use , 1959 .

[13]  C Kooperberg,et al.  Sequence Analysis Using Logic Regression , 2001, Genetic epidemiology.

[14]  C. Sing,et al.  A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. , 2001, Genome research.

[15]  Marylyn D. Ritchie,et al.  Generating Linkage Disequilibrium Patterns in Data Simulations Using genomeSIMLA , 2008, EvoBIO.

[16]  M. Ritchie,et al.  Exploring the Performance of Multifactor Dimensionality Reduction in Large Scale SNP Studies and in the Presence of Genetic Heterogeneity among Epistatic Disease Models , 2008, Human Heredity.

[17]  T. Reich,et al.  A perspective on epistasis: limits of models displaying no main effect. , 2002, American journal of human genetics.

[18]  Marek Kimmel,et al.  Forward-Time Simulations of Human Populations with Complex Diseases , 2007, PLoS genetics.

[19]  Gregory Ewing,et al.  MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus , 2010, Bioinform..

[20]  N. Cook,et al.  Tree and spline based association analysis of gene–gene interaction models for ischemic stroke , 2004, Statistics in medicine.

[21]  Jurg Ott,et al.  Handbook of Human Genetic Linkage , 1994 .

[22]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[23]  F. Balloux EASYPOP (version 1.7): a computer program for population genetics simulations. , 2001, The Journal of heredity.

[24]  Graham Coop,et al.  SelSim: a program to simulate population genetic data with natural selection and recombination , 2004, Bioinform..

[25]  Andrew P Morris,et al.  Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes. , 2004, American journal of human genetics.

[26]  Paul Marjoram,et al.  Fast "coalescent" simulation , 2006, BMC Genetics.

[27]  J. Ott Computer-simulation methods in human linkage analysis. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Jianfeng Liu,et al.  HAPSIMU: a genetic simulation platform for population-based association studies , 2008, BMC Bioinformatics.

[29]  T. Hastie,et al.  Classification of gene microarrays by penalized logistic regression. , 2004, Biostatistics.

[30]  M. Boehnke,et al.  Estimating the power of a proposed linkage study for a complex genetic trait. , 1989, American journal of human genetics.

[31]  Jason H. Moore,et al.  Power of multifactor dimensionality reduction for detecting gene‐gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity , 2003, Genetic epidemiology.

[32]  Jing Li,et al.  Generating samples for association studies based on HapMap data , 2008, BMC Bioinformatics.

[33]  Gonçalo R. Abecasis,et al.  GENOME: a rapid coalescent-based whole genome simulator , 2007, Bioinform..

[34]  C. Hoggart,et al.  Sequence-Level Population Simulations Over Large Genomic Regions , 2007, Genetics.

[35]  Fred A. Wright,et al.  Genetics and population analysis Simulating association studies : a data-based resampling method for candidate regions or whole genome scans , 2007 .

[36]  M Speer,et al.  Chromosome‐based method for rapid computer simulation in human genetic linkage analysis , 1993, Genetic epidemiology.

[37]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[38]  M. Boehnke,et al.  Estimating the power of a proposed linkage study: a practical computer simulation approach. , 1986, American journal of human genetics.

[39]  Marek Kimmel,et al.  simuPOP: a forward-time population genetics simulation environment , 2005, Bioinform..

[40]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[41]  Thomas Mailund,et al.  CoaSim: A flexible environment for simulating genetic data under coalescent models , 2005, BMC Bioinformatics.

[42]  Stephen Wolfram,et al.  Cellular Automata And Complexity , 1994 .

[43]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[44]  William Shannon,et al.  Detecting epistatic interactions contributing to quantitative traits , 2004, Genetic epidemiology.

[45]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[46]  Jason H. Moore,et al.  Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions , 2003, Bioinform..