Forward-time simulation of realistic samples for genome-wide association studies

BackgroundForward-time simulations have unique advantages in power and flexibility for the simulation of genetic samples of complex human diseases because they can closely mimic the evolution of human populations carrying these diseases. However, a number of methodological and computational constraints have prevented the power of this simulation method from being fully explored in existing forward-time simulation methods.ResultsUsing a general-purpose forward-time population genetics simulation environment, we developed a forward-time simulation method that can be used to simulate realistic samples for genome-wide association studies. We examined the properties of this simulation method by comparing simulated samples with real data and demonstrated its wide applicability using four examples, including a simulation of case-control samples with a disease caused by multiple interacting genetic and environmental factors, a simulation of trio families affected by a disease-predisposing allele that had been subjected to either slow or rapid selective sweep, and a simulation of a structured population resulting from recent population admixture.ConclusionsOur algorithm simulates populations that closely resemble the complex structure of the human genome, while allows the introduction of signals of natural selection. Because of its flexibility to generate different types of samples with arbitrary disease or quantitative trait models, this simulation method can simulate realistic samples to evaluate the performance of a wide variety of statistical gene mapping methods for genome-wide association studies.

[1]  J. Callicott,et al.  Intermediate phenotypes in schizophrenia genetics redux: is it a no brainer? , 2008, Molecular Psychiatry.

[2]  P. Donnelly,et al.  Case-control studies of association in structured or admixed populations. , 2001, Theoretical population biology.

[3]  Antonio Carvajal-Rodríguez,et al.  GENOMEPOP: A program to simulate genomes in populations , 2008, BMC Bioinformatics.

[4]  Michael Boehnke,et al.  Joint Modeling of Linkage and Association: Identifying Snps Responsible for a Linkage Signal , 2022 .

[5]  Bo Peng,et al.  Forward-time simulations of non-random mating populations using simuPOP , 2008, Bioinform..

[6]  J. Wall,et al.  When did the human population size start increasing? , 2000, Genetics.

[7]  N. Risch,et al.  Admixture mapping for hypertension loci with genome-scan markers , 2005, Nature Genetics.

[8]  P. Donnelly,et al.  Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip , 2009, PLoS genetics.

[9]  Marek Kimmel,et al.  simuPOP: a forward-time population genetics simulation environment , 2005, Bioinform..

[10]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[11]  M. Slatkin Linkage disequilibrium in growing and stable populations. , 1994, Genetics.

[12]  E. Boerwinkle,et al.  Population structure in admixed populations: effect of admixture dynamics on the pattern of linkage disequilibrium. , 2001, American journal of human genetics.

[13]  J. Pritchard,et al.  A Map of Recent Positive Selection in the Human Genome , 2006, PLoS biology.

[14]  Jacqui Wise Consortium hopes to sequence genome of 1000 volunteers , 2008, BMJ : British Medical Journal.

[15]  M. McCarthy,et al.  Genome-wide association studies for complex traits: consensus, uncertainty and challenges , 2008, Nature Reviews Genetics.

[16]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[17]  Marek Kimmel,et al.  Forward-Time Simulations of Human Populations with Complex Diseases , 2007, PLoS genetics.

[18]  J. Long The genetic structure of admixed populations. , 1991, Genetics.

[19]  C. Hoggart,et al.  Sequence-Level Population Simulations Over Large Genomic Regions , 2007, Genetics.

[20]  Molly Przeworski,et al.  Motivating Hotspots , 2005, Science.

[21]  D. Allison,et al.  Towards sound epistemological foundations of statistical methods for high-dimensional biology , 2004, Nature Genetics.

[22]  Mark Daly,et al.  Haploview: analysis and visualization of LD and haplotype maps , 2005, Bioinform..

[23]  B. Weir,et al.  ESTIMATING F‐STATISTICS FOR THE ANALYSIS OF POPULATION STRUCTURE , 1984, Evolution; international journal of organic evolution.

[24]  Maria De Iorio,et al.  Fregene: Simulation of realistic sequence-level data in populations and ascertained samples , 2008, BMC Bioinformatics.

[25]  Chun Li,et al.  GWAsimulator: a rapid whole-genome simulation program , 2007, Bioinform..

[26]  Jean-Pierre A. Kocher,et al.  GLOSSI: a method to assess the association of genetic loci-sets with complex diseases , 2009, BMC Bioinformatics.

[27]  D I Boomsma,et al.  Joint reanalysis of 29 correlated SNPs supports the role of PCLO/Piccolo as a causal risk factor for major depressive disorder , 2009, Molecular Psychiatry.

[28]  Thomas Mailund,et al.  CoaSim: A flexible environment for simulating genetic data under coalescent models , 2005, BMC Bioinformatics.

[29]  N. Risch,et al.  Estimation of individual admixture: Analytical and study design considerations , 2005, Genetic epidemiology.

[30]  Fred A. Wright,et al.  Genetics and population analysis Simulating association studies : a data-based resampling method for candidate regions or whole genome scans , 2007 .

[31]  Gonçalo R. Abecasis,et al.  GENOME: a rapid coalescent-based whole genome simulator , 2007, Bioinform..

[32]  Paul Marjoram,et al.  Fast "coalescent" simulation , 2006, BMC Genetics.

[33]  Antonio Carvajal-Rodríguez,et al.  Simulation of Genomes: A Review , 2008, Current genomics.

[34]  David V Conti,et al.  Detecting gene-environment interactions using a combined case-only and case-control approach. , 2008, American journal of epidemiology.

[35]  Steven Wiltshire,et al.  Examining the statistical properties of fine‐scale mapping in large‐scale association studies , 2008, Genetic epidemiology.

[36]  Catriona MacCallum,et al.  Being Positive about Selection , 2006, PLoS biology.

[37]  J. Witte,et al.  Genetic dissection of complex traits , 1996, Nature Genetics.

[38]  Shaun Purcell,et al.  Powerful regression-based quantitative-trait linkage analysis of general pedigrees. , 2002, American journal of human genetics.

[39]  W. Ewens Mathematical Population Genetics , 1980 .

[40]  G. Mills,et al.  Genome-wide association scan of tag SNPs identifies a susceptibility locus for lung cancer at 15q25.1 , 2008, Nature Genetics.

[41]  Marek Kimmel,et al.  Simulations Provide Support for the Common Disease–Common Variant Hypothesis , 2007, Genetics.

[42]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[43]  J Krushkal,et al.  Comparison of model‐free linkage mapping strategies for the study of a complex trait , 1997, Genetic epidemiology.

[44]  M Slatkin,et al.  Simulating genealogies of selected alleles in a population of variable size. , 2001, Genetical research.

[45]  Bruce Rannala,et al.  In silico analysis of disease-association mapping strategies using the coalescent process and incorporating ascertainment and selection. , 2005, American journal of human genetics.

[46]  Gary K. Chen,et al.  Fast and flexible simulation of DNA sequence data. , 2008, Genome research.

[47]  R. Williams,et al.  Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. , 1988, American journal of human genetics.

[48]  A. von Haeseler,et al.  A coalescent approach to study linkage disequilibrium between single-nucleotide polymorphisms. , 2000, American journal of human genetics.

[49]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[50]  M. Kimura,et al.  The Stepping Stone Model of Population Structure and the Decrease of Genetic Correlation with Distance. , 1964, Genetics.

[51]  G. McVean,et al.  A genealogical interpretation of linkage disequilibrium. , 2002, Genetics.

[52]  Gil McVean,et al.  The Structure of Linkage Disequilibrium Around a Selective Sweep , 2007, Genetics.

[53]  K. Roeder,et al.  Genomic Control for Association Studies , 1999, Biometrics.

[54]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[55]  P. Donnelly,et al.  A Fine-Scale Map of Recombination Rates and Hotspots Across the Human Genome , 2005, Science.

[56]  Xiaoquan Wen,et al.  Correction: A Map of Recent Positive Selection in the Human Genome , 2006, PLoS Biology.

[57]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[58]  Graham Coop,et al.  SelSim: a program to simulate population genetic data with natural selection and recombination , 2004, Bioinform..

[59]  Bo Peng,et al.  Detection of disease-associated deletions in case–control studies using SNP genotypes with application to rheumatoid arthritis , 2009, Human Genetics.

[60]  David Reich,et al.  Combining evidence of natural selection with association analysis increases power to detect malaria-resistance variants. , 2007, American journal of human genetics.

[61]  G. McVean,et al.  Approximating the coalescent with recombination , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[62]  Kenneth M. Weiss,et al.  ForSim: a tool for exploring the genetic architecture of complex traits with controlled truth , 2008, Bioinform..

[63]  S. Gerbi,et al.  Helen Crouse (1914-2006): imprinting and chromosome behavior. , 2007, Genetics.

[64]  D. Reich,et al.  Will admixture mapping work to find disease genes? , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[65]  P. Donnelly,et al.  Association mapping in structured populations. , 2000, American journal of human genetics.

[66]  S. O’Brien,et al.  Mapping by admixture linkage disequilibrium: advances, limitations and guidelines , 2005, Nature Reviews Genetics.