Polygenic Prediction via Bayesian Regression and Continuous Shrinkage Priors

Polygenic prediction has shown promise in identifying individuals at high risk for complex diseases, and may become clinically useful as the predictive performance of polygenic risk scores (PRS) improves. To date, most applications calculate PRS using a subset of largely independent genetic markers, but this approach discards information and limits the predictive value of PRS. More sophisticated Bayesian genomic prediction methods that jointly model genetic markers across the genome are computationally challenging and do not accurately account for linkage disequilibrium (LD) structure. Here, we present PRS-CS, a novel polygenic prediction method that infers posterior SNP effect sizes using GWAS summary statistics and an external LD reference panel. PRS-CS utilizes a high-dimensional Bayesian regression framework, and is distinct from previous work by placing a continuous shrinkage (CS) prior on SNP effect sizes, which is robust to varying genetic architectures, provides substantial computational advantages, and enables multivariate modeling of local LD patterns. Simulation studies using data from the UK Biobank show that PRS-CS outperforms existing methods across a wide range of effect size distributions, especially when the training sample size is large. We apply PRSCS to predict six common, complex diseases and six quantitative traits in the Partners HealthCare Biobank, for which external large-scale GWAS summary statistics are publicly available, and further demonstrate the improvement of PRS-CS in prediction accuracy over alternative methods.

[1]  Alkes L. Price,et al.  Modeling functional enrichment improves polygenic prediction accuracy in UK Biobank and 23andMe data sets , 2018, bioRxiv.

[2]  Judy H. Cho,et al.  Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations , 2015, Nature Genetics.

[3]  A. Price,et al.  Dissecting the genetics of complex traits using summary association statistics , 2016, Nature Reviews Genetics.

[4]  Jianxin Shi,et al.  Developing and evaluating polygenic risk prediction models for stratified disease prevention , 2016, Nature Reviews Genetics.

[5]  N. Yi,et al.  Bayesian LASSO for Quantitative Trait Loci Mapping , 2008, Genetics.

[6]  A. Gelman Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper) , 2004 .

[7]  Warren W. Kretzschmar,et al.  Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression , 2017, Nature Genetics.

[8]  P. Visscher,et al.  Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores , 2015, bioRxiv.

[9]  P. Visscher,et al.  Meta-analysis of genome-wide association studies for height and body mass index in ∼700,000 individuals of European ancestry , 2018, bioRxiv.

[10]  Gary D Bader,et al.  Association analysis identifies 65 new breast cancer risk loci , 2017, Nature.

[11]  P. Visscher,et al.  Multi-trait analysis of genome-wide association summary statistics using MTAG , 2017, Nature Genetics.

[12]  James G. Scott,et al.  Shrink Globally, Act Locally: Sparse Bayesian Regularization and Prediction , 2022 .

[13]  P. Donnelly,et al.  Genome-wide genetic data on ~500,000 UK Biobank participants , 2017, bioRxiv.

[14]  X. Hua,et al.  Winner's Curse Correction and Variable Thresholding Improve Performance of Polygenic Risk Modeling Based on Genome-Wide Association Study Summary-Level Data , 2016, bioRxiv.

[15]  E. Karlson,et al.  Building the Partners HealthCare Biobank at Partners Personalized Medicine: Informed Consent, Return of Research Results, Recruitment Lessons and Operational Considerations , 2016, Journal of personalized medicine.

[16]  Arnaud Doucet,et al.  Sparse Bayesian nonparametric regression , 2008, ICML '08.

[17]  B. Pasaniuc,et al.  Contrasting the genetic architecture of 30 complex traits from summary association data , 2016, bioRxiv.

[18]  N. Wray,et al.  Estimation of Genetic Correlation via Linkage Disequilibrium Score Regression and Genomic Restricted Maximum Likelihood. , 2018, American journal of human genetics.

[19]  Chris Hans Bayesian lasso regression , 2009 .

[20]  Shizhong Xu Estimating polygenic effects using markers of the entire genome. , 2003, Genetics.

[21]  P. Visscher,et al.  Common polygenic variation contributes to risk of schizophrenia and bipolar disorder , 2009, Nature.

[22]  Nilanjan Chatterjee,et al.  Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits , 2018, Nature Genetics.

[23]  Andres Metspalu,et al.  Improved polygenic prediction by Bayesian multiple regression on summary statistics , 2019, Nature Communications.

[24]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[25]  Robert D. Finn,et al.  The Pfam protein families database: towards a more sustainable future , 2015, Nucleic Acids Res..

[26]  M. Goddard,et al.  Prediction of total genetic value using genome-wide dense marker maps. , 2001, Genetics.

[27]  M. Pirinen,et al.  Prospects of Fine-Mapping Trait-Associated Genomic Regions by Using Summary Statistics from Genome-wide Association Studies. , 2017, American journal of human genetics.

[28]  Tanya M. Teslovich,et al.  An Expanded Genome-Wide Association Study of Type 2 Diabetes in Europeans , 2017, Diabetes.

[29]  Arnaud Doucet,et al.  Bayesian Sparsity-Path-Analysis of Genetic Association Signal using Generalized t Priors , 2011, Statistical applications in genetics and molecular biology.

[30]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[31]  M Erbe,et al.  Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. , 2012, Journal of dairy science.

[32]  Xiang Zhou,et al.  Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models , 2017, Nature Communications.

[33]  J. Griffin,et al.  BAYESIAN HYPER‐LASSOS WITH NON‐CONVEX PENALIZATION , 2011 .

[34]  David B. Dunson,et al.  Generalized Beta Mixtures of Gaussians , 2011, NIPS.

[35]  P. Gustafson,et al.  Conservative prior distributions for variance parameters in hierarchical models , 2006 .

[36]  M. Stephens,et al.  Bayesian variable selection regression for genome-wide association studies and other large-scale problems , 2011, 1110.6019.

[37]  J. Griffin,et al.  Inference with normal-gamma prior distributions in regression problems , 2010 .

[38]  P. Donnelly,et al.  The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[39]  Michael E Goddard,et al.  Accuracy of genomic selection using stochastic search variable selection in Australian Holstein Friesian dairy cattle. , 2009, Genetics research.

[40]  Bogdan Pasaniuc,et al.  Local genetic correlation gives insights into the shared genetic architecture of complex traits , 2016, bioRxiv.

[41]  D. MacArthur,et al.  An eMERGE Clinical Center at Partners Personalized Medicine , 2016, Journal of personalized medicine.

[42]  Mary E. Haas,et al.  Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations , 2018, Nature Genetics.

[43]  I. Johnstone,et al.  Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences , 2004, math/0410088.

[44]  P. Visscher,et al.  Meta-analysis of genome-wide association studies for height and body mass index in ∼700,000 individuals of European ancestry , 2018, bioRxiv.

[45]  Jack Euesden,et al.  PRSice: Polygenic Risk Score software , 2014, Bioinform..

[46]  J. Berger A Robust Generalized Bayes Estimator and Confidence Region for a Multivariate Normal Mean , 1980 .

[47]  Nich Wattanasin,et al.  The Biobank Portal for Partners Personalized Medicine: A Query Tool for Working with Consented Biobank Samples, Genotypes, and Phenotypes Using i2b2 , 2016, Journal of personalized medicine.

[48]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[49]  Joseph K. Pickrell,et al.  Approximately independent linkage disequilibrium blocks in human populations , 2015, bioRxiv.

[50]  M. Goddard,et al.  Mapping multiple QTL using linkage disequilibrium and linkage analysis information and multitrait data , 2004, Genetics Selection Evolution.

[51]  M. Daly,et al.  LD Score regression distinguishes confounding from polygenicity in genome-wide association studies , 2014, Nature Genetics.

[52]  Shane A. McCarthy,et al.  Reference-based phasing using the Haplotype Reference Consortium panel , 2016, Nature Genetics.

[53]  Luc Devroye,et al.  Random variate generation for the generalized inverse Gaussian distribution , 2012, Statistics and Computing.

[54]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[55]  Xiang Zhou,et al.  Polygenic Modeling with Bayesian Sparse Linear Mixed Models , 2012, PLoS genetics.

[56]  G. Casella,et al.  The Bayesian Lasso , 2008 .

[57]  P. Visscher,et al.  Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits , 2012, Nature Genetics.

[58]  W. Strawderman Proper Bayes Minimax Estimators of the Multivariate Normal Mean , 1971 .

[59]  José Crossa,et al.  Predicting Quantitative Traits With Regression Models for Dense Molecular Markers and Pedigree , 2009, Genetics.

[60]  James G. Scott,et al.  The horseshoe estimator for sparse signals , 2010 .

[61]  N. Yi,et al.  Bayesian LASSO for QTL Mapping , 2008 .

[62]  C. Hoggart,et al.  Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies , 2008, PLoS genetics.

[63]  Alan M. Kwong,et al.  Next-generation genotype imputation service and methods , 2016, Nature Genetics.

[64]  Tanya M. Teslovich,et al.  Discovery and refinement of loci associated with lipid levels , 2013, Nature Genetics.

[65]  Aki Vehtari,et al.  On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior , 2016, AISTATS.

[66]  P. Visscher,et al.  Simultaneous Discovery, Estimation and Prediction Analysis of Complex Traits Using a Bayesian Mixture Model , 2015, PLoS genetics.

[67]  J. Danesh,et al.  A comprehensive 1000 Genomes-based genome-wide association meta-analysis of coronary artery disease , 2016 .

[68]  Jun S. Liu,et al.  Genetics of rheumatoid arthritis contributes to biology and drug discovery , 2013 .

[69]  M. Goddard,et al.  Genetic Architecture of Complex Traits and Accuracy of Genomic Prediction: Coat Colour, Milk-Fat Percentage, and Type in Holstein Cattle as Contrasting Model Traits , 2010, PLoS genetics.

[70]  D. Allison,et al.  Beyond Missing Heritability: Prediction of Complex Traits , 2011, PLoS genetics.

[71]  Nengjun Yi,et al.  Stochastic search variable selection for identifying multiple quantitative trait loci. , 2003, Genetics.

[72]  Rohan L. Fernando,et al.  Extension of the bayesian alphabet for genomic selection , 2011, BMC Bioinformatics.

[73]  Michael E Goddard,et al.  Sensitivity of genomic selection to using different prior distributions , 2010, BMC proceedings.

[74]  P. Visscher,et al.  Estimating missing heritability for disease from genome-wide association studies. , 2011, American journal of human genetics.

[75]  Jaeyong Lee,et al.  GENERALIZED DOUBLE PARETO SHRINKAGE. , 2011, Statistica Sinica.

[76]  Sam Clark,et al.  Estimation of genomic prediction accuracy from reference populations with varying degrees of relationship , 2017, bioRxiv.

[77]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.