Joint analysis of individual-level and summary-level GWAS data by leveraging pleiotropy

MOTIVATION A large number of recent genome-wide association studies (GWASs) for complex phenotypes confirm the early conjecture for polygenicity, suggesting the presence of large number of variants with only tiny or moderate effects. However, due to the limited sample size of a single GWAS, many associated genetic variants are too weak to achieve the genome-wide significance. These undiscovered variants further limit the prediction capability of GWAS. Restricted access to the individual-level data and the increasing availability of the published GWAS results motivate the development of methods integrating both the individual-level and summary-level data. How to build the connection between the individual-level and summary-level data determines the efficiency of using the existing abundant summary-level resources with limited individual-level data, and this issue inspires more efforts in the existing area. RESULTS In this study, we propose a novel statistical approach, LEP, which provides a novel way of modeling the connection between the individual-level data and summary-level data. LEP integrates both types of data by LEveraging Pleiotropy to increase the statistical power of risk variants identification and the accuracy of risk prediction. The algorithm for parameter estimation is developed to handle genome-wide-scale data. Through comprehensive simulation studies, we demonstrated the advantages of LEP over the existing methods. We further applied LEP to perform integrative analysis of Crohn's disease from WTCCC and summary statistics from GWAS of some other diseases, such as Type 1 diabetes, Ulcerative colitis and Primary biliary cirrhosis. LEP was able to significantly increase the statistical power of identifying risk variants and improve the risk prediction accuracy from 63.39% (±0.58%) to 68.33% (±0.32%) using about 195 000 variants. AVAILABILITY AND IMPLEMENTATION The LEP software is available at https://github.com/daviddaigithub/LEP. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  P. Visscher,et al.  10 Years of GWAS Discovery: Biology, Function, and Translation. , 2017, American journal of human genetics.

[2]  Tariq Ahmad,et al.  Genome-wide meta-analysis increases to 71 the number of confirmed Crohn's disease susceptibility loci , 2010, Nature Genetics.

[3]  Xiaofeng Zhu,et al.  Meta-analysis of correlated traits via summary statistics from GWASs with an application in hypertension. , 2015, American journal of human genetics.

[4]  P. Visscher,et al.  Common polygenic variation contributes to risk of schizophrenia and bipolar disorder , 2009, Nature.

[5]  J. Flannick,et al.  Type 2 diabetes: genetic data sharing to advance complex disease research , 2016, Nature Reviews Genetics.

[6]  Michael K. Ng,et al.  Introduction to Statistical Methods for Integrative Data Analysis in Genome-Wide Association Studies , 2016 .

[7]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[8]  Isaac Dialsingh,et al.  Large-scale inference: empirical Bayes methods for estimation, testing, and prediction , 2012 .

[9]  A. Price,et al.  Dissecting the genetics of complex traits using summary association statistics , 2016, Nature Reviews Genetics.

[10]  Jin Liu,et al.  IGESS: a statistical approach to integrating individual‐level genotype data and summary statistics in genome‐wide association studies , 2017, Bioinform..

[11]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[12]  M. Daly,et al.  Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis , 2013, The Lancet.

[13]  X. Hua,et al.  Winner's Curse Correction and Variable Thresholding Improve Performance of Polygenic Risk Modeling Based on Genome-Wide Association Study Summary-Level Data , 2016, bioRxiv.

[14]  Kasper Lage,et al.  Pervasive Sharing of Genetic Effects in Autoimmune Disease , 2011, PLoS genetics.

[15]  Steven J. Schrodi,et al.  A missense single-nucleotide polymorphism in a gene encoding a protein tyrosine phosphatase (PTPN22) is associated with rheumatoid arthritis. , 2004, American journal of human genetics.

[16]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[17]  P. Visscher,et al.  Multi-trait analysis of genome-wide association summary statistics using MTAG , 2017, Nature Genetics.

[18]  Can Yang,et al.  Improving genetic risk prediction by leveraging pleiotropy , 2013, Human Genetics.

[19]  Peggy Hall,et al.  The NHGRI GWAS Catalog, a curated resource of SNP-trait associations , 2013, Nucleic Acids Res..

[20]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[21]  Chao Yang,et al.  LLR: a latent low‐rank approach to colocalizing genetic risk variants in multiple GWAS , 2017, Bioinform..

[22]  Qian Wang,et al.  Implications of pleiotropy: challenges and opportunities for mining Big Data in biomedicine , 2015, Front. Genet..

[23]  T. J. Mitchell,et al.  Bayesian Variable Selection in Linear Regression , 1988 .

[24]  Bradley Efron,et al.  Large-scale inference , 2010 .

[25]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[26]  Jin Liu,et al.  EPS: an empirical Bayes approach to integrating pleiotropy and tissue-specific information for prioritizing risk genes , 2016, Bioinform..

[27]  Hongyu Zhao,et al.  GPA: A Statistical Approach to Prioritizing GWAS Results by Integrating Pleiotropy and Annotation , 2014, PLoS genetics.

[28]  A. Khera,et al.  Genetics of coronary artery disease: discovery, biology and clinical translation , 2017, Nature Reviews Genetics.

[29]  M. Stephens,et al.  Scalable Variational Inference for Bayesian Variable Selection in Regression, and Its Accuracy in Genetic Association Studies , 2012 .

[30]  M. Daly,et al.  Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis , 2013, The Lancet.

[31]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[32]  Frank W. Stearns One Hundred Years of Pleiotropy: A Retrospective , 2010, Genetics.

[33]  Pak Chung Sham,et al.  Polygenic scores via penalized regression on summary statistics , 2016, bioRxiv.

[34]  F. Agakov,et al.  Abundant pleiotropy in human complex diseases and traits. , 2011, American journal of human genetics.