Consistency of variety of machine learning and statistical models in predicting clinical risks of individual patients: longitudinal cohort study using cardiovascular disease as exemplar

Abstract Objective To assess the consistency of machine learning and statistical techniques in predicting individual level and population level risks of cardiovascular disease and the effects of censoring on risk predictions. Design Longitudinal cohort study from 1 January 1998 to 31 December 2018. Setting and participants 3.6 million patients from the Clinical Practice Research Datalink registered at 391 general practices in England with linked hospital admission and mortality records. Main outcome measures Model performance including discrimination, calibration, and consistency of individual risk prediction for the same patients among models with comparable model performance. 19 different prediction techniques were applied, including 12 families of machine learning models (grid searched for best models), three Cox proportional hazards models (local fitted, QRISK3, and Framingham), three parametric survival models, and one logistic model. Results The various models had similar population level performance (C statistics of about 0.87 and similar calibration). However, the predictions for individual risks of cardiovascular disease varied widely between and within different types of machine learning and statistical models, especially in patients with higher risks. A patient with a risk of 9.5-10.5% predicted by QRISK3 had a risk of 2.9-9.2% in a random forest and 2.4-7.2% in a neural network. The differences in predicted risks between QRISK3 and a neural network ranged between –23.2% and 0.1% (95% range). Models that ignored censoring (that is, assumed censored patients to be event free) substantially underestimated risk of cardiovascular disease. Of the 223 815 patients with a cardiovascular disease risk above 7.5% with QRISK3, 57.8% would be reclassified below 7.5% when using another model. Conclusions A variety of models predicted risks for the same patients very differently despite similar model performances. The logistic models and commonly used machine learning models should not be directly applied to the prediction of long term risks without considering censoring. Survival models that consider censoring and that are explainable, such as QRISK3, are preferable. The level of consistency within and between models should be routinely assessed before they are used for clinical decision making.

[1]  Gary S Collins,et al.  Machine learning and artificial intelligence research for patient benefit: 20 critical questions on transparency, replicability, ethics, and effectiveness , 2020, BMJ.

[2]  S. Schneeweiss,et al.  Comparison of Machine Learning Methods With Traditional Models for Use of Administrative Claims With Electronic Medical Records to Predict Heart Failure Outcomes , 2020, JAMA network open.

[3]  W. Price,et al.  Potential Liability for Physicians Using Artificial Intelligence. , 2019, JAMA.

[4]  Alejandro Barredo Arrieta,et al.  Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI , 2019, Inf. Fusion.

[5]  Matthew Sperrin,et al.  Do population-level risk prediction models that use routinely collected health data reliably predict individual risks? , 2019, Scientific Reports.

[6]  R. Emsley,et al.  The uncertainty with using risk prediction models for individual decision making: an exemplar cohort study examining the prediction of cardiovascular disease in English primary care , 2019, BMC Medicine.

[7]  Ida Scheel,et al.  Time-to-Event Prediction with Neural Networks and Cox Regression , 2019, J. Mach. Learn. Res..

[8]  J. H. Rudd,et al.  Cardiovascular disease risk prediction using automated machine learning: A prospective study of 423,604 UK Biobank participants , 2019, PloS one.

[9]  Jie Ma,et al.  A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. , 2019, Journal of clinical epidemiology.

[10]  Ioannis A. Kakadiaris,et al.  Machine Learning Outperforms ACC/AHA CVD Risk Calculator in MESA , 2018, Journal of the American Heart Association.

[11]  S. Tamang,et al.  Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data , 2018, JAMA internal medicine.

[12]  Geoffrey E. Hinton Deep Learning-A Technology With the Potential to Transform Health Care. , 2018, JAMA.

[13]  Lawrence Carin,et al.  On Deep Learning for Medical Image Analysis. , 2018, JAMA.

[14]  Elaine O Nsoesie,et al.  Evaluating Artificial Intelligence Applications in Clinical Settings. , 2018, JAMA network open.

[15]  Yuanfang Guan,et al.  Clinical applications of machine learning in cardiovascular disease and its relevance to cardiac imaging. , 2018, European heart journal.

[16]  Evangelos Spiliotis,et al.  Statistical and Machine Learning forecasting methods: Concerns and ways forward , 2018, PloS one.

[17]  Harry Hemingway,et al.  Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease , 2018, bioRxiv.

[18]  Kathleen F. Kerr,et al.  First things first: risk model performance metrics should reflect the clinical application , 2017, Statistics in medicine.

[19]  M. Gleeson,et al.  Methods for estimating costs in patients with hyperlipidemia experiencing their first cardiovascular event in the United Kingdom , 2017, Journal of medical economics.

[20]  J. Hippisley-Cox,et al.  Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study , 2017, British Medical Journal.

[21]  Aurélien Géron,et al.  Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems , 2017 .

[22]  J. Kai,et al.  Can machine-learning improve cardiovascular risk prediction using routine clinical data? , 2017, PloS one.

[23]  Subhashini Venugopalan,et al.  Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. , 2016, JAMA.

[24]  William Briggs,et al.  Uncertainty: The Soul of Modeling, Probability & Statistics , 2016 .

[25]  Gediminas Adomavicius,et al.  Adapting machine learning techniques to censored time-to-event health record data: A general-purpose approach using inverse probability of censoring weighting , 2016, J. Biomed. Informatics.

[26]  Heejung Bang,et al.  How to Establish Clinical Prediction Models , 2016, Endocrinology and metabolism.

[27]  K. Bhaskaran,et al.  Data Resource Profile: Clinical Practice Research Datalink (CPRD) , 2015, International journal of epidemiology.

[28]  G. Collins,et al.  Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement , 2015, BMC Medicine.

[29]  Ben Goldacre,et al.  Prediction of Cardiovascular Risk Using Framingham, ASSIGN and QRISK2: How Well Do They Predict Individual Rather than Population Risk? , 2014, PloS one.

[30]  Carol Coupland,et al.  The performance of seven QPrediction risk scores in an independent external sample of patients from general practice: a validation study , 2014, BMJ Open.

[31]  Daniel Levy,et al.  The Framingham Heart Study and the epidemiology of cardiovascular disease: a historical perspective , 2014, The Lancet.

[32]  Ewout W Steyerberg,et al.  Interpreting the concordance statistic of a logistic regression model: relation to the variance and odds ratio of a continuous explanatory variable , 2012, BMC Medical Research Methodology.

[33]  C.J.H. Mann,et al.  Clinical Prediction Models: A Practical Approach to Development, Validation and Updating , 2009 .

[34]  E. Steyerberg Clinical Prediction Models , 2008, Statistics for Biology and Health.

[35]  Hemant Ishwaran,et al.  Random Survival Forests , 2008, Wiley StatsRef: Statistics Reference Online.

[36]  N. Cook Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction , 2007, Circulation.

[37]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[38]  J R Beck,et al.  Experiments to determine whether recursive partitioning (CART) or an artificial neural network overcomes theoretical limitations of Cox proportional hazards regression. , 1998, Computers and biomedical research, an international journal.

[39]  Martin T. Hagan,et al.  Neural network design , 1995 .

[40]  Douglas G. Altman,et al.  Measurement in Medicine: The Analysis of Method Comparison Studies , 1983 .

[41]  P. McCullagh,et al.  Generalized Linear Models , 1972, Predictive Analytics.

[42]  Philip R. O. Payne,et al.  Questions for Artificial Intelligence in Health Care. , 2019, JAMA.

[43]  T. Therneau,et al.  An Introduction to Recursive Partitioning Using the RPART Routines , 2015 .

[44]  Aditya Krishna Menon,et al.  Large-Scale Support Vector Machines: Algorithms and Theory , 2009 .

[45]  Max Kuhn,et al.  The caret Package , 2007 .

[46]  K. Anderson,et al.  An updated coronary risk profile. A statement for health professionals. , 1991, Circulation.