Fast and flexible simulation of DNA sequence data.

Simulation of genomic sequences under the coalescent with recombination has conventionally been impractical for regions beyond tens of megabases. This work presents an algorithm, implemented as the program MaCS (Markovian Coalescent Simulator), that can efficiently simulate haplotypes under any arbitrary model of population history. We present several metrics comparing the performance of MaCS with other available simulation programs. Practical usage of MaCS is demonstrated through a comparison of measures of linkage disequilibrium between generated program output and real genotype data from populations considered to be structured.

[1]  Andrew P Morris,et al.  Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes. , 2004, American journal of human genetics.

[2]  R. Lewontin The Interaction of Selection and Linkage. I. General Considerations; Heterotic Models. , 1964, Genetics.

[3]  Pardis C Sabeti,et al.  Linkage disequilibrium in the human genome , 2001, Nature.

[4]  J. Pritchard,et al.  A Map of Recent Positive Selection in the Human Genome , 2006, PLoS biology.

[5]  G. A. Watterson On the number of segregating sites in genetical models without recombination. , 1975, Theoretical population biology.

[6]  S. Wright Evolution in mendelian populations , 1931 .

[7]  G Harauz,et al.  Meiotic gene conversion tract length distribution within the rosy locus of Drosophila melanogaster. , 1994, Genetics.

[8]  P. Marjoram,et al.  Ancestral Inference from Samples of DNA Sequences with Recombination , 1996, J. Comput. Biol..

[9]  Richard M. Clark,et al.  Common Sequence Polymorphisms Shaping Genetic Diversity in Arabidopsis thaliana , 2007, Science.

[10]  Gonçalo R. Abecasis,et al.  GENOME: a rapid coalescent-based whole genome simulator , 2007, Bioinform..

[11]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[12]  J. Hein,et al.  The coalescent with gene conversion. , 2000, Genetics.

[13]  Laurent Excoffier,et al.  SIMCOAL 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history , 2004, Bioinform..

[14]  S. Gabriel,et al.  Calibrating a coalescent simulation of human genome sequence variation. , 2005, Genome research.

[15]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[16]  Vincent Plagnol,et al.  Possible Ancestral Structure in Human Populations , 2006, PLoS genetics.

[17]  Paul Marjoram,et al.  Fast "coalescent" simulation , 2006, BMC Genetics.

[18]  Sivakumar Gowrisankar,et al.  Pattern of sequence variation across 213 environmental response genes. , 2004, Genome research.

[19]  Daniel O Stram,et al.  Tag SNP selection for association studies , 2004, Genetic epidemiology.

[20]  Paul Marjoram,et al.  Exploring Population Genetic Models With Recombination Using Efficient Forward-Time Simulations , 2008, Genetics.

[21]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[22]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[23]  Colin N. Dewey,et al.  Population Genomics: Whole-Genome Analysis of Polymorphism and Divergence in Drosophila simulans , 2007, PLoS biology.

[24]  G. McVean,et al.  Approximating the coalescent with recombination , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[25]  F. Tajima Evolutionary relationship of DNA sequences in finite populations. , 1983, Genetics.

[26]  J. Hein,et al.  Recombination as a point process along sequences. , 1999, Theoretical population biology.

[27]  J. Doob Stochastic processes , 1953 .

[28]  Philippe Froguel,et al.  TCF7L2 genetic defect and type 2 diabetes , 2008, Current diabetes reports.

[29]  H. Stefánsson,et al.  Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes , 2006, Nature Genetics.

[30]  R. Hudson,et al.  Interrogating multiple aspects of variation in a full resequencing data set to infer human population size changes. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Thomas Mailund,et al.  CoaSim: A flexible environment for simulating genetic data under coalescent models , 2005, BMC Bioinformatics.

[32]  J. Wall,et al.  Close look at gene conversion hot spots , 2004, Nature Genetics.

[33]  J. Wall,et al.  Gene conversion and different population histories may explain the contrast between polymorphism and linkage disequilibrium levels. , 2001, American journal of human genetics.

[34]  W. G. Hill,et al.  Linkage disequilibrium in finite populations , 1968, Theoretical and Applied Genetics.