Powerful gene set analysis in GWAS with the Generalized Berk-Jones statistic

A common complementary strategy in Genome-Wide Association Studies (GWAS) is to perform Gene Set Analysis (GSA), which tests for the association between one phenotype of interest and an entire set of Single Nucleotide Polymorphisms (SNPs) residing in selected genes. While there exist many tools for performing GSA, popular methods often include a number of ad-hoc steps that are difficult to justify statistically, provide complicated interpretations based on permutation inference, and demonstrate poor operating characteristics. Additionally, the lack of gold standard gene set lists can produce misleading results and create difficulties in comparing analyses even across the same phenotype. We introduce the Generalized Berk-Jones (GBJ) statistic for GSA, a permutation-free parametric framework that offers asymptotic power guarantees in certain set-based testing settings. To adjust for confounding introduced by different gene set lists, we further develop a GBJ step-down inference technique that can discriminate between gene sets driven to significance by single genes and those demonstrating group-level effects. We compare GBJ to popular alternatives through simulation and re-analysis of summary statistics from a large breast cancer GWAS, and we show how GBJ can increase power by incorporating information from multiple signals in the same gene. In addition, we illustrate how breast cancer pathway analysis can be confounded by the frequency of FGFR2 in pathway lists. Our approach is further validated on two other datasets of summary statistics generated from GWAS of height and schizophrenia.

[1]  Daniel L. Koller,et al.  Identification of pathways for bipolar disorder: a meta-analysis. , 2014, JAMA psychiatry.

[2]  A global reference for human genetic variation , 2015, Nature.

[3]  Ross M. Fraser,et al.  Genetic studies of body mass index yield new insights for obesity biology , 2015, Nature.

[4]  Boaz Nadler,et al.  On the exact Berk-Jones statistics and their p-value calculation , 2013, 1311.3190.

[5]  M. Campbell,et al.  PANTHER: a library of protein families and subfamilies indexed by function. , 2003, Genome research.

[6]  Ross M. Fraser,et al.  Defining the role of common variation in the genomic and biological architecture of adult human height , 2014, Nature Genetics.

[7]  Colm O'Dushlaine,et al.  INRICH: interval-based enrichment analysis for genome-wide association studies , 2012, Bioinform..

[8]  Ayellet V. Segrè,et al.  Hundreds of variants clustered in genomic loci and biological pathways affect human height , 2010, Nature.

[9]  Elizabeth A. Heron,et al.  The SNP ratio test: pathway analysis of genome-wide association datasets , 2009, Bioinform..

[10]  Manuel A. R. Ferreira,et al.  Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. , 2009, American journal of human genetics.

[11]  P. Visscher,et al.  Five years of GWAS discovery. , 2012, American journal of human genetics.

[12]  Patrick Neven,et al.  Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast cancer , 2015 .

[13]  Y. Pawitan,et al.  Strategies and issues in the detection of pathway enrichment in genome-wide association studies , 2009, Human Genetics.

[14]  P. Holmans Statistical methods for pathway analysis of genome-wide data for association with complex genetic traits. , 2010, Advances in genetics.

[15]  Sarah C. Ayling,et al.  The Ensembl gene annotation system , 2016, Database J. Biol. Databases Curation.

[16]  K. Lange,et al.  Prioritizing GWAS results: A review of statistical methods and recommendations for their application. , 2010, American journal of human genetics.

[17]  Marina Evangelou,et al.  Comparison of Methods for Competitive Tests of Pathway Analysis , 2012, PloS one.

[18]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[19]  D. Donoho,et al.  Higher criticism for detecting sparse heterogeneous mixtures , 2004, math/0410072.

[20]  Xihong Lin,et al.  Set-Based Tests for Genetic Association Using the Generalized Berk-Jones Statistic , 2017, 1710.02469.

[21]  T. Heskes,et al.  The statistical properties of gene-set analysis , 2016, Nature Reviews Genetics.

[22]  C. Wijmenga,et al.  Using genome‐wide pathway analysis to unravel the etiology of complex diseases , 2009, Genetic epidemiology.

[23]  Michael C Wu,et al.  Prior biological knowledge-based approaches for the analysis of genome-wide expression profiles using gene sets and pathways , 2009, Statistical methods in medical research.

[24]  J. Wellner,et al.  GOODNESS-OF-FIT TESTS VIA PHI-DIVERGENCES , 2006, math/0603238.

[25]  Henning Hermjakob,et al.  The Reactome pathway knowledgebase , 2013, Nucleic Acids Res..

[26]  D. Posthuma,et al.  JAG: A Computational Tool to Evaluate the Role of Gene-Sets in Complex Traits , 2015, Genes.

[27]  H. Hakonarson,et al.  Analysing biological pathways in genome-wide association studies , 2010, Nature Reviews Genetics.

[28]  Mark J. Smyth,et al.  The TRAIL apoptotic pathway in cancer onset, progression and therapy , 2008, Nature Reviews Cancer.

[29]  Ayellet V. Segrè,et al.  Common Inherited Variation in Mitochondrial Genes Is Not Enriched for Associations with Type 2 Diabetes or Related Glycemic Traits , 2010, PLoS genetics.

[30]  Peilin Jia,et al.  Gene set analysis of genome-wide association studies: methodological issues and perspectives. , 2011, Genomics.

[31]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[32]  Wei Zheng,et al.  dmGWAS: dense module searching for genome-wide association studies in protein-protein interaction networks , 2011, Bioinform..

[33]  P. Rosenberg,et al.  Pathway analysis by adaptive combination of P‐values , 2009, Genetic epidemiology.

[34]  Zhongming Zhao,et al.  Pathway-based analysis of GWAS datasets: effective but caution required. , 2011, The international journal of neuropsychopharmacology.

[35]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[36]  Hongsheng Gui,et al.  Comparisons of seven algorithms for pathway analysis using the WTCCC Crohn's Disease dataset , 2011, BMC Research Notes.

[37]  Jason H. Moore,et al.  Pathway analysis of genomic data: concepts, methods, and prospects for future development. , 2012, Trends in genetics : TIG.

[38]  Peter Kraft,et al.  Pathway analysis of breast cancer genome-wide association study highlights three pathways and one canonical signaling cascade. , 2010, Cancer research.

[39]  Douglas H. Jones,et al.  Goodness-of-fit test statistics that dominate the Kolmogorov statistics , 1979 .

[40]  Kai Wang,et al.  Pathway-based approaches for analysis of genomewide association studies. , 2007, American journal of human genetics.

[41]  C. Spencer,et al.  Biological Insights From 108 Schizophrenia-Associated Genetic Loci , 2014, Nature.

[42]  Xia Yang,et al.  Integrating pathway analysis and genetics of gene expression for genome-wide association studies. , 2010, American journal of human genetics.

[43]  Zhongming Zhao,et al.  Network-assisted analysis to prioritize GWAS results: principles, methods and perspectives , 2013, Human Genetics.

[44]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[45]  B. Fridley,et al.  Gene set analysis of SNP data: benefits, challenges, and future directions , 2011, European Journal of Human Genetics.

[46]  P. Visscher,et al.  A versatile gene-based test for genome-wide association studies. , 2010, American journal of human genetics.

[47]  J. Nigg,et al.  Functional and genomic context in pathway analysis of GWAS data. , 2014, Trends in genetics : TIG.

[48]  Leif Groop,et al.  The (in)famous GWAS P-value threshold revisited and updated for low-frequency variants , 2016, European Journal of Human Genetics.

[49]  Joris M. Mooij,et al.  MAGMA: Generalized Gene-Set Analysis of GWAS Data , 2015, PLoS Comput. Biol..

[50]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[51]  M. McCarthy,et al.  Interrogating Type 2 Diabetes Genome-Wide Association Data Using a Biological Pathway-Based Approach , 2009, Diabetes.

[52]  Tune H Pers,et al.  Gene set analysis for interpreting genetic studies. , 2016, Human molecular genetics.

[53]  Xihong Lin,et al.  The Generalized Higher Criticism for Testing SNP-Set Effects in Genetic Association Studies , 2017, Journal of the American Statistical Association.