A tutorial on calibration measurements and calibration models for clinical prediction models

Abstract Our primary objective is to provide the clinical informatics community with an introductory tutorial on calibration measurements and calibration models for predictive models using existing R packages and custom implemented code in R on real and simulated data. Clinical predictive model performance is commonly published based on discrimination measures, but use of models for individualized predictions requires adequate model calibration. This tutorial is intended for clinical researchers who want to evaluate predictive models in terms of their applicability to a particular population. It is also for informaticians and for software engineers who want to understand the role that calibration plays in the evaluation of a clinical predictive model, and to provide them with a solid starting point to consider incorporating calibration evaluation and calibration models in their work. Covered topics include (1) an introduction to the importance of calibration in the clinical setting, (2) an illustration of the distinct roles that discrimination and calibration play in the assessment of clinical predictive models, (3) a tutorial and demonstration of selected calibration measurements, (4) a tutorial and demonstration of selected calibration models, and (5) a brief discussion of limitations of these methods and practical suggestions on how to use them in practice.

[1]  P. Austin,et al.  The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models , 2019, Statistics in medicine.

[2]  George Hripcsak,et al.  Beyond discrimination: A comparison of calibration methods and clinical usefulness of predictive models of readmission risk , 2017, J. Biomed. Informatics.

[3]  D. Cox Two further applications of a model for binary regression , 1958 .

[4]  G. Moisen,et al.  PresenceAbsence: An R Package for Presence Absence Analysis , 2008 .

[5]  Milena Siciliano Nascimento,et al.  Spontaneous breathing test in the prediction of extubation failure in the pediatric population , 2017, Einstein.

[6]  B. Wessler,et al.  Editorial See P 332 , 2022 .

[7]  Xiaoqian Jiang,et al.  Doubly Optimized Calibrated Support Vector Machine (DOC-SVM): An Algorithm for Joint Optimization of Discrimination and Calibration , 2012, PloS one.

[8]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[9]  Steven Shea,et al.  Cardiovascular Event Prediction by Machine Learning: The Multi-Ethnic Study of Atherosclerosis , 2017, Circulation research.

[10]  Shaohan Hu,et al.  Deep Learning for the Internet of Things , 2018, Computer.

[11]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[12]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[13]  G. Guyatt,et al.  Discrimination and Calibration of Clinical Prediction Models: Users’ Guides to the Medical Literature , 2017, JAMA.

[14]  Jonathon M Firnhaber Estimating Cardiovascular Risk. , 2017, American family physician.

[15]  Chuangyin Dang,et al.  Calibrating Classification Probabilities with Shape-Restricted Polynomial Regression , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[17]  Kaspar Rufibach,et al.  Use of Brier score to assess binary predictions. , 2010, Journal of clinical epidemiology.

[18]  Bergen B. Nelson,et al.  Predictors of Poor School Readiness in Children Without Developmental Delay at Age 2 , 2016, Pediatrics.

[19]  Yvonne Vergouwe,et al.  Towards better clinical prediction models: seven steps for development and an ABCD for validation. , 2014, European heart journal.

[20]  N. Obuchowski,et al.  Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures , 2010, Epidemiology.

[21]  Peter A. Flach,et al.  Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers , 2017, AISTATS.

[22]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[23]  F. Harrell,et al.  Evaluating the yield of medical tests. , 1982, JAMA.

[24]  Lixing Zhu,et al.  A modified Hosmer–Lemeshow test for large data sets , 2017 .

[25]  Kibok Lee,et al.  Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples , 2017, ICLR.

[26]  Yvonne Vergouwe,et al.  A calibration hierarchy for risk models was defined: from utopia to empirical data. , 2016, Journal of clinical epidemiology.

[27]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[28]  Colin G. Walsh,et al.  Predicting Risk of Suicide Attempts Over Time Through Machine Learning , 2017 .

[29]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[30]  Leonard A. Smith,et al.  Increasing the Reliability of Reliability Diagrams , 2007 .

[31]  Martha Sajatovic,et al.  Clinical Prediction Models , 2013 .

[32]  Xiaoqian Jiang,et al.  Smooth Isotonic Regression: A New Method to Calibrate Predictive Models , 2011, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[33]  W G Henderson,et al.  Mortality prediction following non‐traumatic amputation of the lower extremity , 2019, The British journal of surgery.

[34]  Caroline Fenlon,et al.  A discussion of calibration techniques for evaluating binary and categorical predictive models. , 2018, Preventive veterinary medicine.

[35]  Olga V. Demler,et al.  Tests of calibration and goodness‐of‐fit in the survival setting , 2015, Statistics in medicine.

[36]  Meike W. Vernooij,et al.  External validation of four dementia prediction models for use in the general community-dwelling population: a comparative analysis from the Rotterdam Study , 2018, European Journal of Epidemiology.

[37]  Ewout W Steyerberg,et al.  Assessment of heterogeneity in an individual participant data meta‐analysis of prediction models: An overview and illustration , 2019, Statistics in medicine.

[38]  D. Mark,et al.  Clinical prediction models: are we building better mousetraps? , 2003, Journal of the American College of Cardiology.

[39]  R. Dorfman,et al.  External validation of the breast reconstruction risk assessment calculator. , 2017, Journal of plastic, reconstructive & aesthetic surgery : JPRAS.

[40]  E. Draper,et al.  Predicting neonatal mortality among very preterm infants: a comparison of three versions of the CRIB score , 2009, Archives of Disease in Childhood: Fetal and Neonatal Edition.

[41]  Ang Li,et al.  Derivation and external validation of the PLASMIC score for rapid assessment of adults with thrombotic microangiopathies: a cohort study. , 2017, The Lancet. Haematology.

[42]  Paco Martorell,et al.  Monetary costs of dementia in the United States. , 2013, The New England journal of medicine.

[43]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[44]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[45]  Stanley Lemeshow,et al.  Standardizing the power of the Hosmer–Lemeshow goodness of fit test in large data sets , 2013, Statistics in medicine.

[46]  Richard D Riley,et al.  External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges , 2016, BMJ.

[47]  D J Spiegelhalter,et al.  Probabilistic prediction in patient management and clinical trials. , 1986, Statistics in medicine.

[48]  D. Hosmer,et al.  Goodness of fit tests for the multiple logistic regression model , 1980 .

[49]  S. Lele A New Method for Estimation of Resource Selection Probability Function , 2009 .

[50]  Colin G. Walsh,et al.  Predicting suicide attempts in adolescents with longitudinal clinical data and machine learning , 2018, Journal of child psychology and psychiatry, and allied disciplines.

[51]  Martin Sill,et al.  DNA methylation-based classification and grading system for meningioma: a multicentre, retrospective analysis. , 2017, The Lancet. Oncology.

[52]  Jihoon Kim,et al.  Calibrating predictive model estimates to support personalized medicine , 2011, J. Am. Medical Informatics Assoc..

[53]  J. Zimmerman,et al.  Assessing the calibration of mortality benchmarks in critical care: The Hosmer-Lemeshow test revisited* , 2007, Critical care medicine.

[54]  F. Harrell,et al.  Regression modelling strategies for improved prognostic prediction. , 1984, Statistics in medicine.

[55]  Howard Rockette,et al.  Statistical Evaluation of Diagnostic Performance: Topics in Roc Analysis , 2011 .

[56]  Xin Lai,et al.  A simple test procedure in standardizing the power of Hosmer–Lemeshow test in large data sets , 2018 .

[57]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[58]  J. Leeuw,et al.  Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods , 2009 .