A statistical boosting framework for polygenic risk scores based on large-scale genotype data

Polygenic risk scores (PRS) evaluate the individual genetic liability to a certain trait and are expected to play an increasingly important role in clinical risk stratification. Most often, PRS are estimated based on summary statistics of univariate effects derived from genome-wide association studies. To improve the predictive performance of PRS, it is desirable to fit multivariable models directly on the genetic data. Due to the large and high-dimensional data, a direct application of existing methods is often not feasible and new efficient algorithms are required to overcome the computational burden regarding efficiency and memory demands. We develop an adapted component-wise L 2-boosting algorithm to fit genotype data from large cohort studies to continuous outcomes using linear base-learners for the genetic variants. Similar to the snpnet approach implementing lasso regression, the proposed snpboost approach iteratively works on smaller batches of variants. By restricting the set of possible base-learners in each boosting step to variants most correlated with the residuals from previous iterations, the computational efficiency can be substantially increased without losing prediction accuracy. Furthermore, for large-scale data based on various traits from the UK Biobank we show that our method yields competitive prediction accuracy and computational efficiency compared to the snpnet approach and further commonly used methods. Due to the modular structure of boosting, our framework can be further extended to construct PRS for different outcome data and effect types—we illustrate this for the prediction of binary traits.

[1]  A. Mayr,et al.  Statistical learning for sparser fine-mapped polygenic models: the prediction of LDL-cholesterol , 2022, bioRxiv.

[2]  R. Tibshirani,et al.  Significant sparse polygenic risk scores across 813 traits in UK Biobank , 2022, PLoS genetics.

[3]  N. Klein,et al.  Deselection of base-learners for statistical boosting—with an application to distributional regression , 2021, Statistical methods in medical research.

[4]  M. Nöthen,et al.  Breast and prostate cancer risk: the interplay of polygenic risk, rare pathogenic germline variants, and family history , 2021, Genetics in Medicine.

[5]  A. Auton,et al.  Incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets , 2021, Nature Communications.

[6]  A. Mayr,et al.  Randomized boosting with multivariable base-learners for high-dimensional variable selection and prediction , 2021, BMC Bioinform..

[7]  J. Cuzick,et al.  The importance of ethnicity: Are breast cancer polygenic risk scores ready for women who are not of White European origin? , 2021, International Journal of Cancer.

[8]  B. Vilhjálmsson,et al.  Improved genetic prediction of complex traits from individual-level data or summary statistics , 2020, Nature Communications.

[9]  S. A. Lambert,et al.  The Polygenic Score Catalog: an open database for reproducibility and systematic evaluation , 2020, medRxiv.

[10]  Trevor Hastie,et al.  Genetics of 35 blood and urine biomarkers in the UK Biobank , 2020, Nature Genetics.

[11]  Ben Lehner,et al.  Biophysical ambiguities prevent accurate genetic prediction , 2020, Nature Communications.

[12]  Trevor Hastie,et al.  Fast Lasso method for Large-scale and Ultrahigh-dimensional Cox Model with applications to UK Biobank , 2020, bioRxiv.

[13]  G. Koppelman,et al.  The genetics of asthma and the promise of genomics-guided drug target discovery. , 2020, The Lancet. Respiratory medicine.

[14]  Bjarni J. Vilhjálmsson,et al.  LDpred2: better, faster, stronger , 2020, bioRxiv.

[15]  Lars G Fritsche,et al.  The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities , 2019, Statistics in medicine.

[16]  Trevor Hastie,et al.  A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank , 2019, bioRxiv.

[17]  Ioannis Ntzoufras,et al.  High-dimensional variable selection via low-dimensional adaptive learning , 2019, Electronic Journal of Statistics.

[18]  Naomi R. Wray,et al.  Improved polygenic prediction by Bayesian multiple regression on summary statistics , 2019, Nature Communications.

[19]  M. Sabatine PCSK9 inhibitors: clinical evidence and implementation , 2018, Nature Reviews Cardiology.

[20]  P. Donnelly,et al.  The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[21]  Timothy Shin Heng Mak,et al.  Tutorial: a guide to performing polygenic risk score analyses , 2018, bioRxiv.

[22]  T. Ge,et al.  Polygenic prediction via Bayesian regression and continuous shrinkage priors , 2018, bioRxiv.

[23]  David R Williams,et al.  Lack Of Diversity In Genomic Databases Is A Barrier To Translating Precision Medicine Research Into Practice. , 2018, Health affairs.

[24]  Andrey Ziyatdinov,et al.  Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr , 2018, Bioinform..

[25]  Benjamin Hofner,et al.  Boosting for statistical modelling-A non-technical introduction , 2018 .

[26]  Ivana V. Yang,et al.  The environment, epigenome, and asthma. , 2017, The Journal of allergy and clinical immunology.

[27]  W. Chung,et al.  Risks of Breast, Ovarian, and Contralateral Breast Cancer for BRCA1 and BRCA2 Mutation Carriers , 2017, JAMA.

[28]  Pak Chung Sham,et al.  Polygenic scores via penalized regression on summary statistics , 2016, bioRxiv.

[29]  Bernd Bischl,et al.  Probing for Sparse and Fast Variable Selection with Model-Based Boosting , 2017, Comput. Math. Methods Medicine.

[30]  Matthias Schmid,et al.  Approaches to Regularized Regression – A Comparison between Gradient Boosting and the Lasso , 2016, Methods of Information in Medicine.

[31]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[32]  P. Visscher,et al.  Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores , 2015, bioRxiv.

[33]  P. Visscher,et al.  Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index , 2015, Nature Genetics.

[34]  P. Visscher,et al.  Simultaneous Discovery, Estimation and Prediction Analysis of Complex Traits Using a Bayesian Mixture Model , 2015, PLoS genetics.

[35]  Jack Euesden,et al.  PRSice: Polygenic Risk Score software , 2014, Bioinform..

[36]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[37]  B. Berger,et al.  Efficient Bayesian mixed model analysis increases association power in large cohorts , 2014, Nature Genetics.

[38]  Ross M. Fraser,et al.  Defining the role of common variation in the genomic and biological architecture of adult human height , 2014, Nature Genetics.

[39]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[40]  P. Visscher,et al.  Inference of the genetic architecture underlying BMI and height with the use of 20,240 sibling pairs. , 2013, American journal of human genetics.

[41]  Jian Huang,et al.  Incorporating group correlations in genome-wide association studies using smoothed group Lasso. , 2013, Biostatistics.

[42]  F. Kronenberg,et al.  Lipoprotein(a): resurrected by genetics , 2013, Journal of internal medicine.

[43]  Doug Speed,et al.  Improved heritability estimation from genome-wide SNPs. , 2012, American journal of human genetics.

[44]  Torsten Hothorn,et al.  Prediction intervals for future BMI values of individual children - a non-parametric approach by quantile boosting , 2012, BMC Medical Research Methodology.

[45]  Gonçalo R. Abecasis,et al.  Fine Mapping of Five Loci Associated with Low-Density Lipoprotein Cholesterol Detects Variants That Double the Explained Heritability , 2011, PLoS genetics.

[46]  C. Wijmenga,et al.  A genetic perspective on coeliac disease. , 2010, Trends in molecular medicine.

[47]  G. Gibson Hints of hidden heritability in GWAS , 2010, Nature Genetics.

[48]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[49]  C. Newton-Cheh,et al.  Blood pressure and human genetic variation in the general population , 2010, Current opinion in cardiology.

[50]  Torsten Hothorn,et al.  Model-based Boosting 2.0 , 2010, J. Mach. Learn. Res..

[51]  B. Maher Personal genomes: The case of the missing heritability , 2008, Nature.

[52]  S. Geer HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO , 2008, 0804.0703.

[53]  Hongzhe Li,et al.  Group additive regression models for genomic data analysis. , 2008, Biostatistics.

[54]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[55]  A. Tsybakov,et al.  Sparsity oracle inequalities for the Lasso , 2007, 0705.3308.

[56]  Hongzhe Li,et al.  Nonparametric pathway-based regression models for analysis of genomic data. , 2007, Biostatistics.

[57]  Nicolai Meinshausen,et al.  Relaxed Lasso , 2007, Comput. Stat. Data Anal..

[58]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[59]  Y. Ritov,et al.  Persistence in high-dimensional linear predictor selection and the virtue of overparametrization , 2004 .

[60]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[61]  A. Rao,et al.  Estimation of Genetic Parameters: principles , 2003 .

[62]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[63]  Wenjiang J. Fu,et al.  Asymptotics for lasso-type estimators , 2000 .

[64]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[65]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .