Two-Stage Designs in Case–Control Association Analysis

DNA pooling is a cost-effective approach for collecting information on marker allele frequency in genetic studies. It is often suggested as a screening tool to identify a subset of candidate markers from a very large number of markers to be followed up by more accurate and informative individual genotyping. In this article, we investigate several statistical properties and design issues related to this two-stage design, including the selection of the candidate markers for second-stage analysis, statistical power of this design, and the probability that truly disease-associated markers are ranked among the top after second-stage analysis. We have derived analytical results on the proportion of markers to be selected for second-stage analysis. For example, to detect disease-associated markers with an allele frequency difference of 0.05 between the cases and controls through an initial sample of 1000 cases and 1000 controls, our results suggest that when the measurement errors are small (0.005), ∼3% of the markers should be selected. For the statistical power to identify disease-associated markers, we find that the measurement errors associated with DNA pooling have little effect on its power. This is in contrast to the one-stage pooling scheme where measurement errors may have large effect on statistical power. As for the probability that the disease-associated markers are ranked among the top in the second stage, we show that there is a high probability that at least one disease-associated marker is ranked among the top when the allele frequency differences between the cases and controls are not <0.05 for reasonably large sample sizes, even though the errors associated with DNA pooling in the first stage are not small. Therefore, the two-stage design with DNA pooling as a screening tool offers an efficient strategy in genomewide association studies, even when the measurement errors associated with DNA pooling are nonnegligible. For any disease model, we find that all the statistical results essentially depend on the population allele frequency and the allele frequency differences between the cases and controls at the disease-associated markers. The general conclusions hold whether the second stage uses an entirely independent sample or includes both the samples used in the first stage and an independent set of samples.

[1]  C. Begg,et al.  Two‐Stage Designs for Gene–Disease Association Studies , 2002, Biometrics.

[2]  R. Elston,et al.  Optimal two‐stage genotyping in population‐based association studies , 2003, Genetic epidemiology.

[3]  M. O’Donovan,et al.  DNA Pooling: a tool for large-scale association studies , 2002, Nature Reviews Genetics.

[4]  D. Clayton,et al.  Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design. , 2002, Annals of human genetics.

[5]  P. Visscher,et al.  SNP genotyping on pooled DNAs: comparison of genotyping technologies and a semi automated method for data storage and analysis. , 2002, Nucleic acids research.

[6]  Hongyu Zhao,et al.  The impacts of errors in individual genotyping and DNA pooling on association studies , 2004, Genetic epidemiology.

[7]  Hongyu Zhao,et al.  Family‐Based Association Tests for Different Family Structures Using Pooled DNA , 2005, Annals of human genetics.

[8]  N. Risch Searching for genetic determinants in the new millennium , 2000, Nature.

[9]  R. Strausberg,et al.  High-throughput development and characterization of a genomewide collection of gene-based single nucleotide polymorphism markers by chip-based matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Stefan Kammerer,et al.  Association testing by DNA pooling: An effective initial screen , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[11]  W. Klitz,et al.  Association mapping of disease loci, by use of a pooled DNA genomic screen. , 1997, American journal of human genetics.

[12]  C. Begg,et al.  Two‐Stage Designs for Gene–Disease Association Studies with Sample Size Constraints , 2004, Biometrics.

[13]  N Risch,et al.  The Future of Genetic Studies of Complex Human Diseases , 1996, Science.

[14]  N Risch,et al.  The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases I. DNA pooling. , 1998, Genome research.

[15]  G. Peltz,et al.  In Silico Mapping of Complex Disease-Related Traits in Mice , 2001, Science.