Powerful gene set analysis in GWAS with the Generalized Berk-Jones statistic

A common complementary strategy in Genome-Wide Association Studies (GWAS) is to perform Gene Set Analysis (GSA), which tests for the association between one phenotype of interest and an entire set of Single Nucleotide Polymorphisms (SNPs) residing in selected genes. While there exist many tools for performing GSA, popular methods often include a number of ad-hoc steps that are difficult to justify statistically, provide complicated interpretations based on permutation inference, and demonstrate poor operating characteristics. Additionally, the lack of gold standard gene set lists can produce misleading results and create difficulties in comparing analyses even across the same phenotype. We introduce the Generalized Berk-Jones (GBJ) statistic for GSA, a permutation-free parametric framework that offers asymptotic power guarantees in certain set-based testing settings. To adjust for confounding introduced by different gene set lists, we further develop a GBJ step-down inference technique that can discriminate between gene sets driven to significance by single genes and those demonstrating group-level effects. We compare GBJ to popular alternatives through simulation and re-analysis of summary statistics from a large breast cancer GWAS, and we show how GBJ can increase power by incorporating information from multiple signals in the same gene. In addition, we illustrate how breast cancer pathway analysis can be confounded by the frequency of FGFR2 in pathway lists. Our approach is further validated on two other datasets of summary statistics generated from GWAS of height and schizophrenia.

[1]  Henning Hermjakob,et al.  The Reactome pathway knowledgebase , 2013, Nucleic Acids Res..

[2]  Xihong Lin,et al.  Set-Based Tests for Genetic Association Using the Generalized Berk-Jones Statistic , 2017, 1710.02469.

[3]  Xihong Lin,et al.  The Generalized Higher Criticism for Testing SNP-Set Effects in Genetic Association Studies , 2017, Journal of the American Statistical Association.

[4]  Tune H Pers,et al.  Gene set analysis for interpreting genetic studies. , 2016, Human molecular genetics.

[5]  Sarah C. Ayling,et al.  The Ensembl gene annotation system , 2016, Database J. Biol. Databases Curation.

[6]  T. Heskes,et al.  The statistical properties of gene-set analysis , 2016, Nature Reviews Genetics.

[7]  Leif Groop,et al.  The (in)famous GWAS P-value threshold revisited and updated for low-frequency variants , 2016, European Journal of Human Genetics.

[8]  Henning Hermjakob,et al.  The Reactome pathway Knowledgebase , 2015, Nucleic acids research.

[9]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[10]  D. Posthuma,et al.  JAG: A Computational Tool to Evaluate the Role of Gene-Sets in Complex Traits , 2015, Genes.

[11]  Joris M. Mooij,et al.  MAGMA: Generalized Gene-Set Analysis of GWAS Data , 2015, PLoS Comput. Biol..

[12]  Patrick Neven,et al.  Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast cancer , 2015 .

[13]  Ross M. Fraser,et al.  Genetic studies of body mass index yield new insights for obesity biology , 2015, Nature.

[14]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[15]  Ross M. Fraser,et al.  Defining the role of common variation in the genomic and biological architecture of adult human height , 2014, Nature Genetics.

[16]  J. Nigg,et al.  Functional and genomic context in pathway analysis of GWAS data. , 2014, Trends in genetics : TIG.

[17]  C. Spencer,et al.  Biological Insights From 108 Schizophrenia-Associated Genetic Loci , 2014, Nature.

[18]  Daniel L. Koller,et al.  Identification of pathways for bipolar disorder: a meta-analysis. , 2014, JAMA psychiatry.

[19]  Boaz Nadler,et al.  On the exact Berk-Jones statistics and their p-value calculation , 2013, 1311.3190.

[20]  Zhongming Zhao,et al.  Network-assisted analysis to prioritize GWAS results: principles, methods and perspectives , 2013, Human Genetics.

[21]  Seunggeun Lee,et al.  General framework for meta-analysis of rare variants in sequencing association studies. , 2013, American journal of human genetics.

[22]  N. Craddock,et al.  Permutation-based approaches do not adequately allow for linkage disequilibrium in gene-wide multi-locus association analysis , 2012, European Journal of Human Genetics.

[23]  Marina Evangelou,et al.  Comparison of Methods for Competitive Tests of Pathway Analysis , 2012, PloS one.

[24]  Jason H. Moore,et al.  Pathway analysis of genomic data: concepts, methods, and prospects for future development. , 2012, Trends in genetics : TIG.

[25]  Colm O'Dushlaine,et al.  INRICH: interval-based enrichment analysis for genome-wide association studies , 2012, Bioinform..

[26]  P. Visscher,et al.  Five years of GWAS discovery. , 2012, American journal of human genetics.

[27]  Jane E. Carpenter,et al.  A common variant at the TERT-CLPTM1L locus is associated with estrogen receptor–negative breast cancer , 2011, Nature Genetics.

[28]  Hongsheng Gui,et al.  Comparisons of seven algorithms for pathway analysis using the WTCCC Crohn's Disease dataset , 2011, BMC Research Notes.

[29]  Xihong Lin,et al.  Rare-variant association testing for sequencing data with the sequence kernel association test. , 2011, American journal of human genetics.

[30]  Peilin Jia,et al.  Gene set analysis of genome-wide association studies: methodological issues and perspectives. , 2011, Genomics.

[31]  Zhongming Zhao,et al.  Pathway-based analysis of GWAS datasets: effective but caution required. , 2011, The international journal of neuropsychopharmacology.

[32]  B. Fridley,et al.  Gene set analysis of SNP data: benefits, challenges, and future directions , 2011, European Journal of Human Genetics.

[33]  Wei Zheng,et al.  dmGWAS: dense module searching for genome-wide association studies in protein-protein interaction networks , 2011, Bioinform..

[34]  H. Hakonarson,et al.  Analysing biological pathways in genome-wide association studies , 2010, Nature Reviews Genetics.

[35]  Ayellet V. Segrè,et al.  Hundreds of variants clustered in genomic loci and biological pathways affect human height , 2010, Nature.

[36]  Ayellet V. Segrè,et al.  Common Inherited Variation in Mitochondrial Genes Is Not Enriched for Associations with Type 2 Diabetes or Related Glycemic Traits , 2010, PLoS genetics.

[37]  P. Visscher,et al.  A versatile gene-based test for genome-wide association studies. , 2010, American journal of human genetics.

[38]  Peter Kraft,et al.  Pathway analysis of breast cancer genome-wide association study highlights three pathways and one canonical signaling cascade. , 2010, Cancer research.

[39]  Xia Yang,et al.  Integrating pathway analysis and genetics of gene expression for genome-wide association studies. , 2010, American journal of human genetics.

[40]  K. Lange,et al.  Prioritizing GWAS results: A review of statistical methods and recommendations for their application. , 2010, American journal of human genetics.

[41]  P. Holmans Statistical methods for pathway analysis of genome-wide data for association with complex genetic traits. , 2010, Advances in genetics.

[42]  P. Rosenberg,et al.  Pathway analysis by adaptive combination of P‐values , 2009, Genetic epidemiology.

[43]  Michael C Wu,et al.  Prior biological knowledge-based approaches for the analysis of genome-wide expression profiles using gene sets and pathways , 2009, Statistical methods in medical research.

[44]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[45]  Elizabeth A. Heron,et al.  The SNP ratio test: pathway analysis of genome-wide association datasets , 2009, Bioinform..

[46]  Manuel A. R. Ferreira,et al.  Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. , 2009, American journal of human genetics.

[47]  C. Wijmenga,et al.  Using genome‐wide pathway analysis to unravel the etiology of complex diseases , 2009, Genetic epidemiology.

[48]  Y. Pawitan,et al.  Strategies and issues in the detection of pathway enrichment in genome-wide association studies , 2009, Human Genetics.

[49]  M. McCarthy,et al.  Interrogating Type 2 Diabetes Genome-Wide Association Data Using a Biological Pathway-Based Approach , 2009, Diabetes.

[50]  Mark J. Smyth,et al.  The TRAIL apoptotic pathway in cancer onset, progression and therapy , 2008, Nature Reviews Cancer.

[51]  Kai Wang,et al.  Pathway-based approaches for analysis of genomewide association studies. , 2007, American journal of human genetics.

[52]  W. Willett,et al.  A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer , 2007, Nature Genetics.

[53]  J. Wellner,et al.  GOODNESS-OF-FIT TESTS VIA PHI-DIVERGENCES , 2006, math/0603238.

[54]  D. Donoho,et al.  Higher criticism for detecting sparse heterogeneous mixtures , 2004, math/0410072.

[55]  M. Campbell,et al.  PANTHER: a library of protein families and subfamilies indexed by function. , 2003, Genome research.

[56]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[57]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[58]  A. V. D. Vaart,et al.  On the Asymptotic Information Bound , 1989 .

[59]  Douglas H. Jones,et al.  Goodness-of-fit test statistics that dominate the Kolmogorov statistics , 1979 .