Controlling the Rate of GWAS False Discoveries

With the rise of both the number and the complexity of traits of interest, control of the false discovery rate (FDR) in genetic association studies has become an increasingly appealing and accepted target for multiple comparison adjustment. While a number of robust FDR-controlling strategies exist, the nature of this error rate is intimately tied to the precise way in which discoveries are counted, and the performance of FDR-controlling procedures is satisfactory only if there is a one-to-one correspondence between what scientists describe as unique discoveries and the number of rejected hypotheses. The presence of linkage disequilibrium between markers in genome-wide association studies (GWAS) often leads researchers to consider the signal associated to multiple neighboring SNPs as indicating the existence of a single genomic locus with possible influence on the phenotype. This a posteriori aggregation of rejected hypotheses results in inflation of the relevant FDR. We propose a novel approach to FDR control that is based on prescreening to identify the level of resolution of distinct hypotheses. We show how FDR-controlling strategies can be adapted to account for this initial selection both with theoretical results and simulations that mimic the dependence structure to be expected in GWAS. We demonstrate that our approach is versatile and useful when the data are analyzed using both tests based on single markers and multiple regression. We provide an R package that allows practitioners to apply our procedure on standard GWAS format data, and illustrate its performance on lipid traits in the North Finland Birth Cohort 66 cohort study.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[3]  Scott L. Zeger,et al.  comments and a rejoinder by the authors) , 2000 .

[4]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[5]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[6]  Chiara Sabatti,et al.  False discovery rate in linkage and association genome screens for complex disorders. , 2003, Genetics.

[7]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[8]  I. Verdinelli,et al.  False Discovery Control for Random Fields , 2004 .

[9]  I. Johnstone,et al.  Adapting to unknown sparsity by controlling the false discovery rate , 2005, math/0505374.

[10]  Y. Benjamini,et al.  Quantitative Trait Loci Analysis Using the False Discovery Rate , 2005, Genetics.

[11]  Y. Benjamini,et al.  False Discovery Rate–Adjusted Multiple Confidence Intervals for Selected Parameters , 2005 .

[12]  Eran Halperin,et al.  Tag SNP selection in genotype data for maximizing SNP prediction accuracy , 2005, ISMB.

[13]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[14]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[15]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[16]  Y. Benjamini,et al.  False Discovery Rates for Spatial Signals , 2007 .

[17]  C. Hoggart,et al.  Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies , 2008, PLoS genetics.

[18]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[19]  P. Visscher,et al.  Common polygenic variation contributes to risk of schizophrenia and bipolar disorder , 2009, Nature.

[20]  C. Hoggart,et al.  Genome-wide association analysis of metabolic traits in a birth cohort from a founder population , 2008, Nature Genetics.

[21]  Benjamin A. Logsdon,et al.  A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis , 2010, BMC Bioinformatics.

[22]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[23]  Hua Zhou,et al.  Association screening of common and rare genetic variants by penalized regression , 2010, Bioinform..

[24]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[25]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[26]  K. Roeder,et al.  Screen and clean: a tool for identifying interactions in genome‐wide association studies , 2010, Genetic epidemiology.

[27]  Qianchuan He,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[28]  Nicholas G. Polson,et al.  Data augmentation for support vector machines , 2011 .

[29]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[30]  N. Wray,et al.  Underestimated Effect Sizes in GWAS: Fundamental Limitations of Single SNP Analysis for Dichotomous Phenotypes , 2011, PloS one.

[31]  Cun-Hui Zhang,et al.  Scaled sparse linear regression , 2011, 1104.4595.

[32]  D. Siegmund,et al.  False discovery rate for scanning statistics , 2011 .

[33]  Luping Zhao,et al.  A Bayesian Semiparametric Temporally-Stratified Proportional Hazards Model with Spatial Frailties. , 2012, Bayesian analysis.

[34]  Kenneth Lange,et al.  Stability selection for genome‐wide association , 2011, Genetic epidemiology.

[35]  Malgorzata Bogdan,et al.  Modified versions of Bayesian Information Criterion for genome-wide association studies , 2012, Comput. Stat. Data Anal..

[36]  M. Stephens,et al.  Scalable Variational Inference for Bayesian Variable Selection in Regression, and Its Accuracy in Genetic Association Studies , 2012 .

[37]  Chiara Sabatti Advances in Statistical Bioinformatics: Multivariate Linear Models for GWAS , 2013 .

[38]  Tanya M. Teslovich,et al.  Discovery and refinement of loci associated with lipid levels , 2013, Nature Genetics.

[39]  Ina Hoeschele,et al.  Penalized Multimarker vs. Single-Marker Regression Methods for Genome-Wide Association Studies of Quantitative Traits , 2014, Genetics.

[40]  Peggy Hall,et al.  The NHGRI GWAS Catalog, a curated resource of SNP-trait associations , 2013, Nucleic Acids Res..

[41]  Yoav Benjamini,et al.  Selective inference on multiple families of hypotheses , 2014 .

[42]  Florian Frommlet,et al.  Analyzing Genome-Wide Association Studies with an FDR Controlling Modification of the Bayesian Information Criterion , 2014, PloS one.

[43]  Eleazar Eskin,et al.  Identifying Causal Variants at Loci with Multiple Signals of Association , 2014, Genetics.

[44]  Weijie J. Su,et al.  SLOPE-ADAPTIVE VARIABLE SELECTION VIA CONVEX OPTIMIZATION. , 2014, The annals of applied statistics.

[45]  Aaditya Ramdas,et al.  The p-filter: multi-layer FDR control for grouped hypotheses , 2015, 1512.03397.

[46]  C. Sabatti,et al.  Genetic Variant Selection: Learning Across Traits and Sites , 2015, Genetics.

[47]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[48]  Teresa A. Webster,et al.  Genotyping Informatics and Quality Control for 100,000 Subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) Cohort , 2015, Genetics.

[49]  Christine B. Peterson,et al.  Many Phenotypes Without Many False Discoveries: Error Controlling Strategies for Multitrait Association Studies , 2015, Genetic epidemiology.