Leveraging both individual-level genetic data and GWAS summary statistics increases polygenic prediction

The accuracy of polygenic risk scores (PRSs) to predict complex diseases increases with the training sample size. PRSs are generally derived based on summary statistics from large meta-analyses of multiple genome-wide association studies (GWAS). However, it is now common for researchers to have access to large individual-level data as well, such as the UK biobank data. To the best of our knowledge, it has not yet been explored how to best combine both types of data (summary statistics and individual-level data) to optimize polygenic prediction. The most widely used approach to combine data is the meta-analysis of GWAS summary statistics (Meta-GWAS), but we show that it does not always provide the most accurate PRS. Through simulations and using twelve real case-control and quantitative traits from both iPSYCH and UK Biobank along with external GWAS summary statistics, we compare Meta-GWAS with two alternative data-combining approaches, stacked clumping and thresholding (SCT) and Meta-PRS. We find that, when large individual-level data is available, the linear combination of PRSs (Meta-PRS) is both a simple alternative to Meta-GWAS and often more accurate.

[1]  M. Blum,et al.  Making the most of Clumping and Thresholding for polygenic scores , 2019, bioRxiv.

[2]  Ross M. Fraser,et al.  Genetic studies of body mass index yield new insights for obesity biology , 2015, Nature.

[3]  Pak Chung Sham,et al.  Polygenic scores via penalized regression on summary statistics , 2016, bioRxiv.

[4]  Alicia R. Martin,et al.  Discovery of the first genome-wide significant risk loci for attention deficit/hyperactivity disorder , 2018, Nature Genetics.

[5]  Gary D Bader,et al.  Association analysis identifies 65 new breast cancer risk loci , 2017, Nature.

[6]  Po-Ru Loh,et al.  Multi-ethnic polygenic risk scores improve risk prediction in diverse populations , 2016, bioRxiv.

[7]  Zachary F. Gerring,et al.  Unraveling the genetic architecture of major depressive disorder: merits and pitfalls of the approaches used in genome-wide association studies , 2019, Psychological Medicine.

[8]  A. Shabalin,et al.  Polygenic risk scoring and prediction of mental health outcomes. , 2019, Current opinion in psychology.

[9]  Anorexia Nervosa Genetics Initiative Genome-wide association study identifies eight risk loci and implicates metabo-psychiatric origins for anorexia nervosa , 2019 .

[10]  Po-Ru Loh,et al.  Mixed-model association for biobank-scale datasets , 2018, Nature Genetics.

[11]  Ross M. Fraser,et al.  Defining the role of common variation in the genomic and biological architecture of adult human height , 2014, Nature Genetics.

[12]  Doug Speed,et al.  MultiBLUP: improved SNP-based prediction for complex traits , 2014, Genome research.

[13]  B. Neale,et al.  Non-parametric Polygenic Risk Prediction via Partitioned GWAS Summary Statistics. , 2020, American journal of human genetics.

[14]  Mary E. Haas,et al.  Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations , 2018, Nature Genetics.

[15]  John P. Rice,et al.  Identification of common genetic risk variants for autism spectrum disorder , 2019, Nature Genetics.

[16]  Yun Li,et al.  METAL: fast and efficient meta-analysis of genomewide association scans , 2010, Bioinform..

[17]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[18]  Naomi R. Wray,et al.  Improved polygenic prediction by Bayesian multiple regression on summary statistics , 2019, Nature Communications.

[19]  A. Janssens,et al.  Reflection on modern methods: Revisiting the area under the ROC Curve. , 2020, International journal of epidemiology.

[20]  Matthew Stephens,et al.  Large-scale genome-wide enrichment analyses identify new trait-associated genes and pathways across 31 human phenotypes , 2018, Nature Communications.

[21]  John J. McGrath,et al.  Efficient toolkit implementing best practices for principal component analysis of population genetic data , 2019, bioRxiv.

[22]  G. Breen,et al.  Multi-polygenic score approach to trait prediction , 2017, Molecular Psychiatry.

[23]  J. Danesh,et al.  A comprehensive 1000 Genomes-based genome-wide association meta-analysis of coronary artery disease , 2016 .

[24]  Hunna J. Watson,et al.  Genome-wide association study identifies eight risk loci and implicates metabo-psychiatric origins for anorexia nervosa , 2019, Nature Genetics.

[25]  Helen E. Parkinson,et al.  The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019 , 2018, Nucleic Acids Res..

[26]  G. Breen,et al.  Evaluation of polygenic prediction methodology within a reference-standardized framework , 2020, bioRxiv.

[27]  N. Wray,et al.  A genome-wide association study of shared risk across psychiatric disorders implicates gene regulation during fetal neurodevelopment , 2019, Nature Neuroscience.

[28]  Tanya M. Teslovich,et al.  An Expanded Genome-Wide Association Study of Type 2 Diabetes in Europeans , 2017, Diabetes.

[29]  M Erbe,et al.  Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. , 2012, Journal of dairy science.

[30]  Bonnie Berger,et al.  Efficient Bayesian mixed model analysis increases association power in large cohorts , 2014 .

[31]  P. Donnelly,et al.  The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[32]  M. Goddard Genomic selection: prediction of accuracy and maximisation of long term response , 2009, Genetica.

[33]  Tanya M. Teslovich,et al.  Genetic evidence of assortative mating in humans , 2017, Nature Human Behaviour.

[34]  Tian Ge,et al.  Polygenic Prediction via Bayesian Regression and Continuous Shrinkage Priors , 2018 .

[35]  P. Visscher,et al.  Estimating missing heritability for disease from genome-wide association studies. , 2011, American journal of human genetics.

[36]  M. Blum,et al.  Efficient Implementation of Penalized Regression for Genetic Risk Prediction , 2018, Genetics.

[37]  Stephan Ripke,et al.  Improving genetic prediction by leveraging genetic correlations among human diseases and traits , 2018, Nature Communications.

[38]  Warren W. Kretzschmar,et al.  Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression , 2017, Nature Genetics.

[39]  P. Visscher,et al.  Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores , 2015, bioRxiv.

[40]  Justin Zobel,et al.  SparSNP: Fast and memory-efficient analysis of all SNPs for phenotype prediction , 2012, BMC Bioinformatics.

[41]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[42]  Benjamin Neale,et al.  Genomic Risk Prediction of Coronary Artery Disease in 480,000 Adults Implications for Primary Prevention , 2019 .

[43]  Josyf Mychaleckyj,et al.  Robust relationship inference in genome-wide association studies , 2010, Bioinform..

[44]  M. Daly,et al.  The iPSYCH2012 case–cohort sample: new directions for unravelling genetic and environmental architectures of severe mental disorders , 2017, Molecular Psychiatry.

[45]  Andrey Ziyatdinov,et al.  Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr , 2018, Bioinform..

[46]  Melissa J. Green,et al.  Genome-wide association study identifies 30 Loci Associated with Bipolar Disorder , 2017, bioRxiv.

[47]  Xiang Zhou,et al.  Polygenic Modeling with Bayesian Sparse Linear Mixed Models , 2012, PLoS genetics.

[48]  P. Visscher,et al.  A comprehensive evaluation of polygenic score methods across cohorts in psychiatric disorders , 2020, medRxiv.

[49]  Alicia R. Martin,et al.  Clinical use of current polygenic risk scores may exacerbate health disparities , 2019, Nature Genetics.

[50]  R. Marioni,et al.  Edinburgh Research Explorer Genome-wide association study of depression phenotypes in UK Biobank identifies variants in excitatory synaptic pathways , 2022 .

[51]  N. Wray,et al.  Research review: Polygenic methods and their application to psychiatric traits. , 2014, Journal of child psychology and psychiatry, and allied disciplines.

[52]  P. Visscher,et al.  The Genetic Interpretation of Area under the ROC Curve in Genomic Profiling , 2010, PLoS genetics.

[53]  C. Spencer,et al.  Biological Insights From 108 Schizophrenia-Associated Genetic Loci , 2014, Nature.

[54]  Xiang Zhou,et al.  Accurate and Scalable Construction of Polygenic Scores in Large Biobank Data Sets. , 2020, American journal of human genetics.

[55]  Jack Euesden,et al.  PRSice: Polygenic Risk Score software , 2014, Bioinform..

[56]  Robert Karlsson,et al.  RICOPILI: Rapid Imputation for COnsortias PIpeLIne , 2019, bioRxiv.

[57]  S. A. Lambert,et al.  The Polygenic Score Catalog: an open database for reproducibility and systematic evaluation , 2020, medRxiv.