Power estimation and sample size determination for replication studies of genome-wide association studies

BackgroundReplication study is a commonly used verification method to filter out false positives in genome-wide association studies (GWAS). If an association can be confirmed in a replication study, it will have a high confidence to be true positive. To design a replication study, traditional approaches calculate power by treating replication study as another independent primary study. These approaches do not use the information given by primary study. Besides, they need to specify a minimum detectable effect size, which may be subjective. One may think to replace the minimum effect size with the observed effect sizes in the power calculation. However, this approach will make the designed replication study underpowered since we are only interested in the positive associations from the primary study and the problem of the “winner’s curse” will occur.ResultsAn Empirical Bayes (EB) based method is proposed to estimate the power of replication study for each association. The corresponding credible interval is estimated in the proposed approach. Simulation experiments show that our method is better than other plug-in based estimators in terms of overcoming the winner’s curse and providing higher estimation accuracy. The coverage probability of given credible interval is well-calibrated in the simulation experiments. Weighted average method is used to estimate the average power of all underlying true associations. This is used to determine the sample size of replication study. Sample sizes are estimated on 6 diseases from Wellcome Trust Case Control Consortium (WTCCC) using our method. They are higher than sample sizes estimated by plugging observed effect sizes in power calculation.ConclusionsOur new method can objectively determine replication study’s sample size by using information extracted from primary study. Also the winner’s curse is alleviated. Thus, it is a better choice when designing replication studies of GWAS. The R-package is available at: http://bioinformatics.ust.hk/RPower.html.

[1]  Thomas W. Mühleisen,et al.  Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease , 2011, Nature Genetics.

[2]  Fei Zou,et al.  Estimating odds ratios in genome scans: an approximate conditional likelihood approach. , 2008, American journal of human genetics.

[3]  J Blangero,et al.  Large upward bias in estimation of locus-specific effects from genomewide scans. , 2001, American journal of human genetics.

[4]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[5]  B. Lindqvist,et al.  Estimating the proportion of true null hypotheses, with application to DNA microarray data , 2005 .

[6]  Tanya M. Teslovich,et al.  Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes , 2012, Nature Genetics.

[7]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[8]  T. Cai,et al.  Estimating the Null and the Proportion of Nonnull Effects in Large-Scale Multiple Comparisons , 2006, math/0611108.

[9]  B. Woolf ON ESTIMATING THE RELATION BETWEEN BLOOD GROUP AND DISEASE , 1955, Annals of human genetics.

[10]  Hongyu Zhao,et al.  Empirical Bayes Correction for the Winner's Curse in Genetic Association Studies , 2013, Genetic epidemiology.

[11]  Peter Kraft,et al.  Replication in genome-wide association studies. , 2009, Statistical science : a review journal of the Institute of Mathematical Statistics.

[12]  R. Prentice,et al.  Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. , 2008, Biostatistics.

[13]  P. Donnelly,et al.  Replicating genotype–phenotype associations , 2007, Nature.

[14]  Nilanjan Chatterjee,et al.  Estimation of effect size distribution from genome-wide association studies and implications for future discoveries , 2010, Nature Genetics.

[15]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[16]  J. Ioannidis Why Most Discovered True Associations Are Inflated , 2008, Epidemiology.

[17]  D. Balding A tutorial on statistical methods for population association studies , 2006, Nature Reviews Genetics.

[18]  N. Mehta Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. , 2011, Circulation. Cardiovascular genetics.

[19]  Radu V. Craiu,et al.  Bayesian methods to overcome the winner’s curse in genetic studies , 2009, 0907.2770.

[20]  J. Pritchard,et al.  Overcoming the winner's curse: estimating penetrance parameters from case-control data. , 2007, American journal of human genetics.

[21]  Bradley Efron,et al.  Local False Discovery Rates , 2005 .

[22]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[23]  V. Bansal,et al.  Statistical analysis strategies for association studies involving rare variants , 2010, Nature Reviews Genetics.

[24]  Shelley B. Bull,et al.  BR-squared: a practical solution to the winner’s curse in genome-wide scans , 2011, Human Genetics.