Simulated data for a complex genetic trait (Problem 2 for GAW11): How the model was developed, and why

This paper describes a simulated data set created as Problem 2 for GAW11. The generating model for Problem 2 involved two different genetic diseases, or “types,” in three separate populations. The two‐locus (2L) type results from the epistatic interaction of two genetic loci, and the three‐allele type, from a single locus with two disease‐causing alleles and one normal allele. Each type has two phenotypic forms: Mild and Severe. Both forms are subject to both genetic and environmental influences. The disease occurs in three different hypothetical populations, each with different disease allele frequencies and penetrances. In two populations there is also a fourth locus with an allele that is associated with the 2L type. Misdiagnosis can occur, but only after a family has already been ascertained through ≥ 2 “genetically” affected offspring. Finally, the three different populations are studied by four different hypothetical research groups. These groups each have their own ideas about how the disease is inherited and have therefore devised different ascertainment schemes based on those beliefs. Each research group collected 100‐family data sets, including data on 300 markers on six chromosomes and measurements on disease status and on the proposed two environmental factors. GAW participants were supplied with 25 random replicates of each data set.