Bayesian large-scale multiple regression with summary statistics from genome-wide association studies

Bayesian methods for large-scale multiple regression provide attractive approaches to the analysis of genome-wide association studies (GWAS). For example, they can estimate heritability of complex traits, allowing for both polygenic and sparse models; and by incorporating external genomic data into the priors, they can increase power and yield new biological insights. However, these methods require access to individual genotypes and phenotypes, which are often not easily available. Here we provide a framework for performing these analyses without individual-level data. Specifically, we introduce a "Regression with Summary Statistics" (RSS) likelihood, which relates the multiple regression coefficients to univariate regression results that are often easily available. The RSS likelihood requires estimates of correlations among covariates (SNPs), which also can be obtained from public databases. We perform Bayesian multiple regression analysis by combining the RSS likelihood with previously proposed prior distributions, sampling posteriors by Markov chain Monte Carlo. In a wide range of simulations RSS performs similarly to analyses using the individual data, both for estimating heritability and detecting associations. We apply RSS to a GWAS of human height that contains 253,288 individuals typed at 1.06 million SNPs, for which analyses of individual-level data are practically impossible. Estimates of heritability (52%) are consistent with, but more precise, than previous results using subsets of these data. We also identify many previously unreported loci that show evidence for association with height in our analyses. Software is available at https://github.com/stephenslab/rss.

[1]  M. Stephens,et al.  Bayesian variable selection regression for genome-wide association studies and other large-scale problems , 2011, 1110.6019.

[2]  Yongtao Guan,et al.  Practical Issues in Imputation-Based Association Mapping , 2008, PLoS genetics.

[3]  J. R. Young,et al.  Asking for More , 1967, Nature.

[4]  K. Roeder,et al.  Genomic Control for Association Studies , 1999, Biometrics.

[5]  B. Efron Bayes and likelihood calculations from confidence intervals , 1993 .

[6]  Akira Yamaguchi,et al.  p300/CBP Acts as a Coactivator to Cartilage Homeoprotein‐1 (Cart1), Paired‐Like Homeoprotein, Through Acetylation of the Conserved Lysine Residue Adjacent to the Homeodomain , 2003, Journal of bone and mineral research : the official journal of the American Society for Bone and Mineral Research.

[7]  Gregory A. Poland,et al.  Fine Mapping Causal Variants with an Approximate Bayesian Method Using Marginal Test Statistics , 2015, Genetics.

[8]  Matthew Stephens,et al.  False discovery rates: a new deal , 2016, bioRxiv.

[9]  P. Visscher,et al.  Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores , 2015, bioRxiv.

[10]  Yakir A Reshef,et al.  Partitioning heritability by functional annotation using genome-wide association summary statistics , 2015, Nature Genetics.

[11]  M. Stephens,et al.  Genome-wide Efficient Mixed Model Analysis for Association Studies , 2012, Nature Genetics.

[12]  J. Pritchard,et al.  Linkage disequilibrium in humans: models and data. , 2001, American journal of human genetics.

[13]  Bonnie Berger,et al.  Efficient Bayesian mixed model analysis increases association power in large cohorts , 2014 .

[14]  Matthew Stephens,et al.  USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. , 2010, The annals of applied statistics.

[15]  Gary S Stein,et al.  Role of the WWOX tumor suppressor gene in bone homeostasis and the pathogenesis of osteosarcoma. , 2011, American journal of cancer research.

[16]  B Müller-Myhsok,et al.  Rapid simulation of P values for product methods and multiple-testing adjustment in association studies. , 2005, American journal of human genetics.

[17]  Ross M. Fraser,et al.  Defining the role of common variation in the genomic and biological architecture of adult human height , 2014, Nature Genetics.

[18]  Matthew Stephens,et al.  BAYESIAN METHODS FOR GENETIC ASSOCIATION ANALYSIS WITH HETEROGENEOUS SUBGROUPS: FROM META-ANALYSES TO GENE-ENVIRONMENT INTERACTIONS. , 2011, The annals of applied statistics.

[19]  Donghyung Lee,et al.  JEPEG: a summary statistics based tool for gene-level joint testing of functional variants , 2014, Bioinform..

[20]  M. Stephens,et al.  Imputation-Based Analysis of Association Studies: Candidate Regions and Quantitative Traits , 2007, PLoS genetics.

[21]  W. G. Hill,et al.  Heritability in the genomics era — concepts and misconceptions , 2008, Nature Reviews Genetics.

[22]  Stefano Volinia,et al.  The WWOX Tumor Suppressor Is Essential for Postnatal Survival and Normal Bone Metabolism* , 2008, Journal of Biological Chemistry.

[23]  P. Visscher,et al.  A versatile gene-based test for genome-wide association studies. , 2010, American journal of human genetics.

[24]  Sylvia Richardson,et al.  JAM: A Scalable Bayesian Framework for Joint Analysis of Marginal SNP Effects , 2016, Genetic epidemiology.

[25]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[26]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[27]  Frank Dudbridge,et al.  A Fast Method that Uses Polygenic Scores to Estimate the Variance Explained by Genome-wide Marker Panels and the Proportion of Variants Affecting a Trait. , 2015, American journal of human genetics.

[28]  Trevor J. Sweeting,et al.  On a Converse to Scheffe's Theorem , 1986 .

[29]  Zoltán Kutalik,et al.  A multi-SNP locus-association method reveals a substantial fraction of the missing heritability. , 2012, American journal of human genetics.

[30]  Paolo Bientinesi,et al.  High performance solutions for big-data GWAS , 2014, Parallel Comput..

[31]  Pietro Liò,et al.  The BioMart community portal: an innovative alternative to large, centralized data repositories , 2015, Nucleic Acids Res..

[32]  Joseph K. Pickrell Joint analysis of functional genomic data and genome-wide association studies of 18 human traits , 2013, bioRxiv.

[33]  Han Xu,et al.  Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. , 2014, American journal of human genetics.

[34]  Jon Wakefield,et al.  Bayes factors for genome‐wide association studies: comparison with P‐values , 2009, Genetic epidemiology.

[35]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[36]  Xiang Zhou,et al.  Polygenic Modeling with Bayesian Sparse Linear Mixed Models , 2012, PLoS genetics.

[37]  P. Visscher,et al.  Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits , 2012, Nature Genetics.

[38]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[39]  Yongtao Guan,et al.  Advances in Statistical Bioinformatics: Whole-Genome Multi-SNP-Phenotype Association Analysis , 2013 .

[40]  C. Hoggart,et al.  Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies , 2008, PLoS genetics.

[41]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[42]  D. Y. Lin,et al.  An efficient Monte Carlo approach to assessing statistical significance in genomic studies , 2005, Bioinform..

[43]  Nilanjan Chatterjee,et al.  Estimation of effect size distribution from genome-wide association studies and implications for future discoveries , 2010, Nature Genetics.

[44]  W. G. Hill,et al.  Genome partitioning of genetic variation for complex traits using common SNPs , 2011, Nature Genetics.

[45]  M. Stephens,et al.  SUPPLEMENT TO “ BAYESIAN LARGE-SCALE MULTIPLE REGRESSION WITH SUMMARY STATISTICS FROM GENOME-WIDE ASSOCIATION STUDIES , 2017 .

[46]  William Wheeler,et al.  A Powerful Procedure for Pathway-Based Meta-analysis Using Summary Statistics Identifies 43 Pathways Associated with Type II Diabetes in European Populations , 2016, bioRxiv.

[47]  M. McCarthy,et al.  Genome-wide association studies for complex traits: consensus, uncertainty and challenges , 2008, Nature Reviews Genetics.

[48]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[49]  G. Casella,et al.  Rao-Blackwellisation of sampling schemes , 1996 .

[50]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[51]  M. Stephens,et al.  Integrated Enrichment Analysis of Variants and Pathways in Genome-Wide Association Studies Indicates Central Role for IL-2 Signaling Genes in Type 1 Diabetes, and Cytokine Signaling Genes in Crohn's Disease , 2013, PLoS genetics.

[52]  M. Daly,et al.  LD Score regression distinguishes confounding from polygenicity in genome-wide association studies , 2014, Nature Genetics.

[53]  Sevan Hopyan,et al.  Formation of proximal and anterior limb skeleton requires early function of Irx3 and Irx5 and is negatively regulated by Shh signaling. , 2014, Developmental cell.

[54]  Tanya M. Teslovich,et al.  Discovery and refinement of loci associated with lipid levels , 2013, Nature Genetics.

[55]  P. Visscher,et al.  Simultaneous Discovery, Estimation and Prediction Analysis of Complex Traits Using a Bayesian Mixture Model , 2015, PLoS genetics.

[56]  Peter Donnelly,et al.  Progress and challenges in genome-wide association studies in humans , 2008, Nature.

[57]  M. Stephens A Unified Framework for Association Analysis with Multiple Related Phenotypes , 2013, PloS one.

[58]  Wei-Ju Liao,et al.  Electrostatics and N-glycan-mediated membrane tethering of SCUBE1 is critical for promoting bone morphogenetic protein signalling. , 2016, The Biochemical journal.

[59]  Eleazar Eskin,et al.  Identifying Causal Variants at Loci with Multiple Signals of Association , 2014, Genetics.

[60]  T. Glant,et al.  Genetics of Rheumatoid Arthritis — A Comprehensive Review , 2013, Clinical Reviews in Allergy & Immunology.

[61]  J. Ioannidis,et al.  Meta-analysis methods for genome-wide association studies and beyond , 2013, Nature Reviews Genetics.

[62]  H. Hakonarson,et al.  Analysing biological pathways in genome-wide association studies , 2010, Nature Reviews Genetics.

[63]  Donghyung Lee,et al.  DIST: direct imputation of summary statistics for unmeasured SNPs , 2013, Bioinform..