Mind the Performance Gap: Examining Dataset Shift During Prospective Validation

Once integrated into clinical care, patient risk stratification models may perform worse compared to their retrospective performance. To date, it is widely accepted that performance will degrade over time due to changes in care processes and patient populations. However, the extent to which this occurs is poorly understood, in part because few researchers report prospective validation performance. In this study, we compare the 2020-2021 (’20-’21) prospective performance of a patient risk stratification model for predicting healthcareassociated infections to a 2019-2020 (’19-’20) retrospective validation of the same model. We define the difference in retrospective and prospective performance as the performance gap. We estimate how i) “temporal shift”, i.e., changes in clinical workflows and patient populations, and ii) “infrastructure shift”, i.e., changes in access, extraction and transformation of data, both contribute to the performance gap. Applied prospectively to 26,864 hospital encounters during a twelve-month period from July 2020 to June 2021, the model achieved an area under the receiver operating characteristic curve (AUROC) of 0.767 (95% confidence interval (CI): 0.737, 0.801) and a Brier score of 0.189 (95% CI: 0.186, 0.191). Prospective performance decreased slightly compared to ’19-’20 retrospective performance, in which the model achieved an AUROC of 0.778 (95% CI: 0.744, 0.815) and a Brier score of 0.163 (95% CI: 0.161, 0.165). The resulting performance gap was primarily due to infrastructure shift and not temporal shift. So long as we continue to develop and validate models using data stored in large research data warehouses, we must consider differences in how and when data are accessed, measure how these differences may negatively affect prospective performance, and work to mitigate those differences. © 2021 E. Ötleş & J. Oh et al. Mind the Performance Gap Dataset Shift During Prospective Validation

[1]  Michael E Matheny,et al.  Prognostic models will be victims of their own success, unless , 2019, J. Am. Medical Informatics Assoc..

[2]  D. Gerding,et al.  Trends in U.S. Burden of Clostridioides difficile Infection and Outcomes. , 2020, The New England journal of medicine.

[3]  Jenna Wiens,et al.  A Generalizable, Data-Driven Approach to Predict Daily Risk of Clostridium difficile Infection at Two Large Academic Health Centers , 2018, Infection Control & Hospital Epidemiology.

[4]  Ameen Abu-Hanna,et al.  Effect of changes over time in the performance of a customized SAPS-II model on the quality of care assessment , 2013 .

[5]  Tien Yin Wong,et al.  Artificial intelligence using deep learning to screen for referable and vision-threatening diabetic retinopathy in Africa: a clinical validation study. , 2019, The Lancet. Digital health.

[6]  A. Artenstein In Pursuit of PPE , 2020, The New England journal of medicine.

[7]  P. Scardino,et al.  Implementation of Dynamically Updated Prediction Models at the Point of Care at a Major Cancer Center: Making Nomograms More Like Netflix. , 2017, Urology.

[8]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[9]  I. Kohane,et al.  Big Data and Machine Learning in Health Care. , 2018, JAMA.

[10]  Robert Canelli,et al.  Personal Protective Equipment and Covid-19. , 2020, The New England journal of medicine.

[11]  Michael Gao,et al.  Prospective and External Evaluation of a Machine Learning Model to Predict In-Hospital Mortality of Adults at Time of Admission. , 2020, JAMA network open.

[12]  Iain Buchan,et al.  Dynamic Prediction Modeling Approaches for Cardiac Surgery , 2013, Circulation. Cardiovascular quality and outcomes.

[13]  E. WongLaura,et al.  Where Are All the Patients? Addressing Covid-19 Fear to Encourage Sick Patients to Seek Emergency Care , 2020 .

[14]  M van Smeden,et al.  Changing predictor measurement procedures affected the performance of prediction models in clinical examples. , 2019, Journal of clinical epidemiology.

[15]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[16]  Leo A. Celi,et al.  The MIMIC Code Repository: enabling reproducibility in critical care research , 2017, J. Am. Medical Informatics Assoc..

[17]  Guanhua Chen,et al.  Calibration drift in regression and machine learning models for acute kidney injury , 2017, J. Am. Medical Informatics Assoc..

[18]  P. Venkatesan The changing demographics of COVID-19 , 2020, The Lancet Respiratory Medicine.

[19]  Peter A. Flach,et al.  A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss C` Esar Ferri , 2012 .

[20]  Matthew M Churpek,et al.  Real-Time Risk Prediction on the Wards: A Feasibility Study , 2016, Critical care medicine.

[21]  S. Saria,et al.  The Clinician and Dataset Shift in Artificial Intelligence. , 2021, The New England journal of medicine.

[22]  V. Parwani,et al.  Psychiatric emergency department volume during Covid-19 pandemic , 2020, The American Journal of Emergency Medicine.

[23]  Marzyeh Ghassemi,et al.  Rethinking clinical prediction: Why machine learning must consider year of care and feature aggregation , 2018, ArXiv.

[24]  Helen Burstin,et al.  Strategies to Prevent Clostridium difficile Infections in Acute Care Hospitals , 2008, Infection Control & Hospital Epidemiology.

[25]  Jenna Wiens,et al.  Patient Risk Stratification with Time-Varying Parameters: A Multitask Learning Approach , 2016, J. Mach. Learn. Res..

[26]  Stuart Keel,et al.  Feasibility and patient acceptability of a novel artificial intelligence-based screening model for diabetic retinopathy at endocrinology outpatient services: a pilot study , 2018, Scientific Reports.

[27]  S Lemeshow,et al.  Effect of changing patient mix on the performance of an intensive care unit severity-of-illness model: how to distinguish a general from a specialty intensive care unit. , 1996, Critical care medicine.

[28]  C. Eckert,et al.  Does a rapid diagnosis of Clostridium difficile infection impact on quality of patient management? , 2014, Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases.

[29]  R S Evans,et al.  Electronic Health Records: Then, Now, and in the Future , 2016, Yearbook of Medical Informatics.

[30]  D. Paterson,et al.  Clostridium difficile Infection Seasonality: Patterns across Hemispheres and Continents – A Systematic Review , 2015, PloS one.

[31]  M. Abràmoff,et al.  Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices , 2018, npj Digital Medicine.

[32]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[33]  P. Rothwell,et al.  Prognostic models , 2008, Practical Neurology.