A two-stage design for multiple testing in large-scale association studies

AbstractModern association studies often involve a large number of markers and hence may encounter the problem of testing multiple hypotheses. Traditional procedures are usually over-conservative and with low power to detect mild genetic effects. From the design perspective, we propose a two-stage selection procedure to address this concern. Our main principle is to reduce the total number of tests by removing clearly unassociated markers in the first-stage test. Next, conditional on the findings of the first stage, which uses a less stringent nominal level, a more conservative test is conducted in the second stage using the augmented data and the data from the first stage. Previous studies have suggested using independent samples to avoid inflated errors. However, we found that, after accounting for the dependence between these two samples, the true discovery rate increases substantially. In addition, the cost of genotyping can be greatly reduced via this approach. Results from a study of hypertriglyceridemia and simulations suggest the two-stage method has a higher overall true positive rate (TPR) with a controlled overall false positive rate (FPR) when compared with single-stage approaches. We also report the analytical form of its overall FPR, which may be useful in guiding study design to achieve a high TPR while retaining the desired FPR.

[1]  Hiroshi Sato,et al.  Functional SNPs in the lymphotoxin-α gene that are associated with susceptibility to myocardial infarction , 2002, Nature Genetics.

[2]  P. Sullivan,et al.  A Framework for Controlling False Discovery Rates and Minimizing the Amount of Genotyping in the Search for Disease Mutations , 2004, Human Heredity.

[3]  Andreas Ziegler,et al.  Sequential Designs for Genetic Epidemiological Linkage or Association Studies A Review of the Literature , 2001 .

[4]  J. Ott,et al.  Selecting SNPs in two‐stage analysis of disease association data: a model‐free approach , 2000, Annals of human genetics.

[5]  Chen-An Tsai,et al.  Estimation of False Discovery Rates in Multiple Testing: Application to Gene Microarray Data , 2003, Biometrics.

[6]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[7]  David B Allison,et al.  Two-stage testing in microarray analysis: what is gained? , 2002, The journals of gerontology. Series A, Biological sciences and medical sciences.

[8]  Katsushi Tokunaga,et al.  The power of genome-wide association studies of complex disease genes: statistical limitations of indirect approaches using SNP markers , 2001, Journal of Human Genetics.

[9]  Yusuke Nakamura,et al.  Gene-based SNP discovery as part of the Japanese Millennium Genome Project: identification of 190 562 genetic variations in the human genome , 2002, Journal of Human Genetics.

[10]  Yongchao Ge Resampling-based Multiple Testing for Microarray Data Analysis , 2003 .

[11]  N Risch,et al.  The Future of Genetic Studies of Complex Human Diseases , 1996, Science.

[12]  R. Elston,et al.  Optimal two‐stage genotyping in population‐based association studies , 2003, Genetic epidemiology.

[13]  J. Ott,et al.  Statistical multilocus methods for disequilibrium analysis in complex traits , 2001, Human mutation.

[14]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[15]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[16]  Naoyuki Kamatani,et al.  Strategies for genome-wide association studies: optimization of study designs by the stepwise focusing method , 2002, Journal of Human Genetics.

[17]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[18]  C. Begg,et al.  Two‐Stage Designs for Gene–Disease Association Studies , 2002, Biometrics.

[19]  C. Begg,et al.  Two‐Stage Designs for Gene–Disease Association Studies with Sample Size Constraints , 2004, Biometrics.

[20]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[21]  E. Dougherty,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[22]  Patrick F Sullivan,et al.  False discoveries and models for gene discovery. , 2003, Trends in genetics : TIG.

[23]  A. Galecki,et al.  Interpretation, design, and analysis of gene array expression experiments. , 2001, The journals of gerontology. Series A, Biological sciences and medical sciences.

[24]  D. Nyholt A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. , 2004, American journal of human genetics.

[25]  K. Chien,et al.  A novel genetic variant in the apolipoprotein A5 gene is associated with hypertriglyceridemia. , 2003, Human molecular genetics.

[26]  Michael Knapp,et al.  Maximum‐likelihood estimation of haplotype frequencies in nuclear families , 2004, Genetic epidemiology.

[27]  N E Morton,et al.  Genetic epidemiology of single-nucleotide polymorphisms. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[28]  D. Thomas,et al.  Two‐Stage sampling designs for gene association studies , 2004, Genetic epidemiology.

[29]  J. Sudbø,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[30]  R. Elston,et al.  Two‐stage global search designs for linkage analysis using pairs of affected relatives , 1996 .

[31]  S. Gabriel,et al.  Quality and completeness of SNP databases , 2003, Nature Genetics.

[32]  R. Elston,et al.  Two‐stage global search designs for linkage analysis I: Use of the mean statistic for affected sib pairs , 2000, Genetic epidemiology.