Polygenic risk scores outperform machine learning methods in predicting coronary artery disease status

Coronary artery disease (CAD) is the leading global cause of mortality and has substantial heritability with a polygenic architecture. Recent approaches of risk prediction were based on polygenic risk scores (PRS) not taking possible nonlinear effects into account and restricted in that they focused on genetic loci associated with CAD, only. We benchmarked PRS, (penalized) logistic regression, naïve Bayes (NB), random forests (RF), support vector machines (SVM), and gradient boosting (GB) on a data set of 7,736 CAD cases and 6,774 controls from Germany to identify the algorithms for most accurate classification of CAD status. The final models were tested on an independent data set from Germany (527 CAD cases and 473 controls). We found PRS to be the best algorithm, yielding an area under the receiver operating curve (AUC) of 0.92 (95% CI [0.90, 0.95], 50,633 loci) in the German test data. NB and SVM (AUC ~ 0.81) performed better than RF and GB (AUC ~ 0.75). We conclude that using PRS to predict CAD is superior to machine learning methods.

[1]  L. Peltonen,et al.  A multilocus genetic risk score for coronary heart disease: case-control and prospective cohort analyses , 2010, The Lancet.

[2]  Alan M. Kwong,et al.  A reference panel of 64,976 haplotypes for genotype imputation , 2015, Nature Genetics.

[3]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[4]  W. März,et al.  Rationale and design of the LURIC study--a resource for functional genomics, pharmacogenomics and long-term prognosis of cardiovascular disease. , 2001, Pharmacogenomics.

[5]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[6]  Yang I Li,et al.  An Expanded View of Complex Traits: From Polygenic to Omnigenic , 2017, Cell.

[7]  C. Gieger,et al.  Genomewide association analysis of coronary artery disease. , 2007, The New England journal of medicine.

[8]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[9]  C. Gieger,et al.  Genome-wide association study identifies a new locus for coronary artery disease on chromosome 10p11.23. , 2011, European heart journal.

[10]  D. Levy,et al.  Prediction of coronary heart disease using risk factor categories. , 1998, Circulation.

[11]  Association study between variants in the fibrinogen gene cluster, fibrinogen levels and hypertension: results from the MONICA/KORA study. , 2009, Thrombosis and haemostasis.

[12]  C. Gieger,et al.  KORA-gen - Resource for Population Genetics, Controls and a Broad Spectrum of Disease Phenotypes , 2005 .

[13]  Stefano Nembrini,et al.  The revival of the Gini importance? , 2018, Bioinform..

[14]  N. Cook Statistical evaluation of prognostic versus diagnostic models: beyond the ROC curve. , 2008, Clinical chemistry.

[15]  Kurt Hornik,et al.  Misc Functions of the Department of Statistics, ProbabilityTheory Group (Formerly: E1071), TU Wien , 2015 .

[16]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[17]  Sarah Lewis,et al.  Genetic epidemiology and public health: hope, hype, and future prospects , 2005, The Lancet.

[18]  L. Berkman,et al.  Genetic susceptibility to death from coronary heart disease in a study of twins. , 1994, The New England journal of medicine.

[19]  Alan D. Lopez,et al.  Global and regional burden of disease and risk factors, 2001: systematic analysis of population health data , 2006, The Lancet.

[20]  T. Hansen,et al.  A genetic risk score of 45 coronary artery disease risk variants associates with increased risk of myocardial infarction in 6041 Danish individuals. , 2015, Atherosclerosis.

[21]  Markus Perola,et al.  Genomic prediction of coronary heart disease , 2016, bioRxiv.

[22]  J. Danesh,et al.  Large-scale association analysis identifies new risk loci for coronary artery disease , 2013 .

[23]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[24]  A genomic exploration identifies mechanisms that may explain adverse cardiovascular effects of COX-2 inhibitors , 2017, Scientific Reports.

[25]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[26]  Bernd Bischl,et al.  Resampling Methods for Meta-Model Validation with Recommendations for Evolutionary Computation , 2012, Evolutionary Computation.

[27]  A. Khera,et al.  Genetics of coronary artery disease: discovery, biology and clinical translation , 2017, Nature Reviews Genetics.

[28]  Shu Ye,et al.  Genomic Risk Prediction of Coronary Artery Disease in 480,000 Adults , 2018, Journal of the American College of Cardiology.

[29]  T. Thomsen HeartScore®: a new web-based approach to European cardiovascular disease risk management , 2005, European journal of cardiovascular prevention and rehabilitation : official journal of the European Society of Cardiology, Working Groups on Epidemiology & Prevention and Cardiac Rehabilitation and Exercise Physiology.

[30]  M. Keltai,et al.  [Effect of potentially modifiable risk factors associated with myocardial infarction in 52 countries in a case-control study based on the INTERHEART study]. , 2006, Orvosi hetilap.

[31]  S. Yusuf,et al.  Effect of potentially modifiable risk factors associated with myocardial infarction in 52 countries (the INTERHEART study): case-control study , 2004, The Lancet.

[32]  J. Catanese,et al.  Risk prediction by genetic risk scores for coronary heart disease is independent of self-reported family history , 2015, European heart journal.

[33]  S. Humphries,et al.  Assessment of the clinical utility of adding common single nucleotide polymorphism genetic scores to classical risk factor algorithms in coronary heart disease risk prediction in UK men , 2017, Clinical chemistry and laboratory medicine.

[34]  Daniel F. Schwarz,et al.  New susceptibility locus for coronary artery disease on chromosome 3q22.3 , 2009, Nature Genetics.

[35]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[36]  Bernd Bischl,et al.  mlrMBO: A Modular Framework for Model-Based Optimization of Expensive Black-Box Functions , 2017, 1703.03373.

[37]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[38]  Mary E. Haas,et al.  Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations , 2018, Nature Genetics.

[39]  Michael Krawczak,et al.  PopGen: Population-Based Recruitment of Patients and Controls for the Analysis of Complex Genotype-Phenotype Relationships , 2006, Public Health Genomics.

[40]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[41]  O. Delaneau,et al.  Supplementary Information for ‘ Improved whole chromosome phasing for disease and population genetic studies ’ , 2012 .

[42]  Christian Gieger,et al.  Novel multiple sclerosis susceptibility loci implicated in epigenetic regulation , 2016, Science Advances.

[43]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[44]  Andreas Ziegler,et al.  Do little interactions get lost in dark random forests? , 2016, BMC Bioinformatics.

[45]  Bernd Bischl,et al.  mlr: Machine Learning in R , 2016, J. Mach. Learn. Res..

[46]  C. Gieger,et al.  Genome-wide association study identifies a new locus for coronary artery disease on chromosome 10 p 11 . 23 , 2010 .

[47]  K. Taylor,et al.  Genome-Wide Association , 2007, Diabetes.

[48]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .