A spline-based tool to assess and visualize the calibration of multiclass risk predictions

When validating risk models (or probabilistic classifiers), calibration is often overlooked. Calibration refers to the reliability of the predicted risks, i.e. whether the predicted risks correspond to observed probabilities. In medical applications this is important because treatment decisions often rely on the estimated risk of disease. The aim of this paper is to present generic tools to assess the calibration of multiclass risk models. We describe a calibration framework based on a vector spline multinomial logistic regression model. This framework can be used to generate calibration plots and calculate the estimated calibration index (ECI) to quantify lack of calibration. We illustrate these tools in relation to risk models used to characterize ovarian tumors. The outcome of the study is the surgical stage of the tumor when relevant and the final histological outcome, which is divided into five classes: benign, borderline malignant, stage I, stage II-IV, and secondary metastatic cancer. The 5909 patients included in the study are randomly split into equally large training and test sets. We developed and tested models using the following algorithms: logistic regression, support vector machines, k nearest neighbors, random forest, naive Bayes and nearest shrunken centroids. Multiclass calibration plots are interesting as an approach to visualizing the reliability of predicted risks. The ECI is a convenient tool for comparing models, but is less informative and interpretable than calibration plots. In our case study, logistic regression and random forest showed the highest degree of calibration, and the naive Bayes the lowest.

[1]  Paul Sajda,et al.  Machine learning for detection and diagnosis of disease. , 2006, Annual review of biomedical engineering.

[2]  Christian Weimar,et al.  Probability estimation with machine learning methods for dichotomous and multicategory outcome: Applications , 2014, Biometrical journal. Biometrische Zeitschrift.

[3]  C. Wild,et al.  Vector Generalized Additive Models , 1996 .

[4]  S Van Huffel,et al.  Ovarian cancer prediction in adnexal masses using ultrasound‐based logistic regression models: a temporal and external validation study by the IOTA group , 2010, Ultrasound in obstetrics & gynecology : the official journal of the International Society of Ultrasound in Obstetrics and Gynecology.

[5]  T. Alonzo Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating By Ewout W. Steyerberg , 2009 .

[6]  M. Kattan Comparison of Cox regression with other methods for determining prediction models and nomograms. , 2003, The Journal of urology.

[7]  Yvonne Vergouwe,et al.  Prognosis and prognostic research: validating a prognostic model , 2009, BMJ : British Medical Journal.

[8]  Mark Helfand Shared Decision Making, Decision Aids, and Risk Communication , 2007, Medical decision making : an international journal of the Society for Medical Decision Making.

[9]  Paulo J. G. Lisboa,et al.  The Use of Artificial Neural Networks in Decision Support in Cancer: a Systematic Review , 2005 .

[10]  M. Kenward,et al.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls , 2009, BMJ : British Medical Journal.

[11]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[12]  Lucila Ohno-Machado,et al.  The use of receiver operating characteristic curves in biomedical informatics , 2005, J. Biomed. Informatics.

[13]  J. Kassirer,et al.  Therapeutic decision making: a cost-benefit analysis. , 1975, The New England journal of medicine.

[14]  Sabine Van Huffel,et al.  Extending the c‐statistic to nominal polytomous outcomes: the Polytomous Discrimination Index , 2012, Statistics in medicine.

[15]  Sabine Van Huffel,et al.  Comparing Methods for Multi-class Probabilities in Medical Decision Making Using LS-SVMs and Kernel Logistic Regression , 2007, ICANN.

[16]  Ameen Abu-Hanna,et al.  Clinical prognostic methods: trends and developments. , 2014, Journal of biomedical informatics.

[17]  Xindong Wu,et al.  The Top Ten Algorithms in Data Mining , 2009 .

[18]  Antonio Eleuteri,et al.  A web-based tool for the assessment of discrimination and calibration properties of prognostic models , 2008, Comput. Biol. Medicine.

[19]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[20]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[21]  Ewout W Steyerberg,et al.  Risk prediction with machine learning and regression methods , 2014, Biometrical journal. Biometrische Zeitschrift.

[22]  I. Rubinfeld,et al.  Quantifying surgical complexity with machine learning: looking beyond patient factors to improve surgical models. , 2014, Surgery.

[23]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[24]  Jeremy MG Taylor,et al.  Partially parametric techniques for multiple imputation , 1996 .

[25]  M. Kohler,et al.  Probability estimation with machine learning methods for dichotomous and multicategory outcome: Theory , 2014, Biometrical journal. Biometrische Zeitschrift.

[26]  B. van Calster,et al.  Calibration of Risk Prediction Models , 2015, Medical decision making : an international journal of the Society for Medical Decision Making.

[27]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[28]  N. Obuchowski,et al.  Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures , 2010, Epidemiology.

[29]  J. D. Malley,et al.  Probability Machines , 2011, Methods of Information in Medicine.

[30]  D. Hosmer,et al.  A comparison of goodness-of-fit tests for the logistic regression model. , 1997, Statistics in medicine.

[31]  Pang-Ning Tan,et al.  kNN: k-Nearest Neighbors , 2009 .

[32]  Mesut Remzi,et al.  Novel artificial neural network for early detection of prostate cancer. , 2002, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[33]  Sunil J Rao,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2003 .

[34]  David S. Wishart,et al.  Applications of Machine Learning in Cancer Prediction and Prognosis , 2006, Cancer informatics.

[35]  Geoffrey I. Webb Naïve Bayes , 2020, Encyclopedia of Machine Learning.

[36]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[37]  Richard Simon,et al.  The Use of Genomics in Clinical Trial Design , 2008, Clinical Cancer Research.

[38]  S. Van Huffel,et al.  Viability of intrauterine pregnancy in women with pregnancy of unknown location: prediction using human chorionic gonadotropin ratio vs. progesterone , 2010, Ultrasound in obstetrics & gynecology : the official journal of the International Society of Ultrasound in Obstetrics and Gynecology.

[39]  Stefan Kramer,et al.  Ensembles of nested dichotomies for multi-class problems , 2004, ICML.

[40]  Gene H. Golub,et al.  Generalized cross-validation as a method for choosing a good ridge parameter , 1979, Milestones in Matrix Computation.

[41]  Jihoon Kim,et al.  Calibrating predictive model estimates to support personalized medicine , 2011, J. Am. Medical Informatics Assoc..

[42]  Holly Janes,et al.  A Framework for Evaluating Markers Used to Select Patient Treatment , 2014, Medical decision making : an international journal of the Society for Medical Decision Making.

[43]  Sabine Van Huffel,et al.  Assessing calibration of multinomial risk prediction models , 2014, Statistics in medicine.

[44]  Ameen Abu-Hanna,et al.  A comparison of the performance of a model based on administrative data and a model based on clinical data: Effect of severity of illness on standardized mortality ratios of intensive care units* , 2012, Critical care medicine.

[45]  S Van Huffel,et al.  Prediction of ectopic pregnancy in women with a pregnancy of unknown location , 2007, Ultrasound in obstetrics & gynecology : the official journal of the International Society of Ultrasound in Obstetrics and Gynecology.

[46]  Xiaoqian Jiang,et al.  Doubly Optimized Calibrated Support Vector Machine (DOC-SVM): An Algorithm for Joint Optimization of Discrimination and Calibration , 2012, PloS one.

[47]  A Ziegler,et al.  EDITOR Comments on ‘Practical experiences on the necessity of external validation’ , 2008 .

[48]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[49]  T. Bourne,et al.  Logistic regression model to distinguish between the benign and malignant adnexal mass before surgery: a multicenter study by the International Ovarian Tumor Analysis Group. , 2005, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[50]  Dirk Timmerman,et al.  Evaluating the risk of ovarian cancer before surgery using the ADNEX model to differentiate between benign, borderline, early and advanced stage invasive, and secondary metastatic tumours: prospective multicentre diagnostic study , 2014, BMJ : British Medical Journal.

[51]  Ewout W. Steyerberg,et al.  Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers , 2013, Statistics in medicine.

[52]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[53]  Lucila Ohno-Machado,et al.  Effects of SVM parameter optimization on discrimination and calibration for post-procedural PCI mortality , 2007, J. Biomed. Informatics.

[54]  C.J.H. Mann,et al.  Clinical Prediction Models: A Practical Approach to Development, Validation and Updating , 2009 .

[55]  Dirk Timmerman,et al.  Assessing the discriminative ability of risk models for more than two outcome categories , 2012, European Journal of Epidemiology.

[56]  Alan Agresti,et al.  Categorical Data Analysis , 2003 .

[57]  Thomas W. Yee,et al.  Vector splines and other vector smoothers , 2000 .

[58]  Y. Vergouwe,et al.  Validation, updating and impact of clinical prediction rules: a review. , 2008, Journal of clinical epidemiology.

[59]  S. Van Huffel,et al.  Prospective Internal Validation of Mathematical Models to Predict Malignancy in Adnexal Masses: Results from the International Ovarian Tumor Analysis Study , 2009, Clinical Cancer Research.

[60]  Matthias Guckenberger,et al.  Support vector machine-based prediction of local tumor control after stereotactic body radiation therapy for early-stage non-small cell lung cancer. , 2014, International journal of radiation oncology, biology, physics.

[61]  Andreas Ziegler,et al.  Risk estimation and risk prediction using machine-learning methods , 2012, Human Genetics.

[62]  Lucila Ohno-Machado,et al.  Discrimination and calibration of mortality risk prediction models in interventional cardiology , 2005, J. Biomed. Informatics.

[63]  Ewout W. Steyerberg,et al.  F1000Prime recommendation of Calibration of risk prediction models: impact on decision-analytic performance. , 2014 .