Calibration of medical diagnostic classifier scores to the probability of disease

Scores produced by statistical classifiers in many clinical decision support systems and other medical diagnostic devices are generally on an arbitrary scale, so the clinical meaning of these scores is unclear. Calibration of classifier scores to a meaningful scale such as the probability of disease is potentially useful when such scores are used by a physician. In this work, we investigated three methods (parametric, semi-parametric, and non-parametric) for calibrating classifier scores to the probability of disease scale and developed uncertainty estimation techniques for these methods. We showed that classifier scores on arbitrary scales can be calibrated to the probability of disease scale without affecting their discrimination performance. With a finite dataset to train the calibration function, it is important to accompany the probability estimate with its confidence interval. Our simulations indicate that, when a dataset used for finding the transformation for calibration is also used for estimating the performance of calibration, the resubstitution bias exists for a performance metric involving the truth states in evaluating the calibration performance. However, the bias is small for the parametric and semi-parametric methods when the sample size is moderate to large (>100 per class).

[1]  C. Metz,et al.  Maximum likelihood estimation of receiver operating characteristic (ROC) curves from continuously-distributed data. , 1998, Statistics in medicine.

[2]  Maryellen L Giger,et al.  Prevalence scaling: applications to an intelligent workstation for the diagnosis of breast cancer. , 2008, Academic radiology.

[3]  Kevin C. Dorff,et al.  The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models , 2010, Nature Biotechnology.

[4]  Yulei Jiang,et al.  A scaling transformation for classifier output based on likelihood ratio: applications to a CAD workstation for diagnosis of breast cancer. , 2012, Medical physics.

[5]  Tom Fawcett,et al.  PAV and the ROC convex hull , 2007, Machine Learning.

[6]  W. Press,et al.  Numerical Recipes in C++: The Art of Scientific Computing (2nd edn)1 Numerical Recipes Example Book (C++) (2nd edn)2 Numerical Recipes Multi-Language Code CD ROM with LINUX or UNIX Single-Screen License Revised Version3 , 2003 .

[7]  Arun Ross,et al.  Score normalization in multimodal biometric systems , 2005, Pattern Recognit..

[8]  Douglas Mossman,et al.  Using Dual Beta Distributions to Create “Proper” ROC Curves Based on Rating Category Data , 2016, Medical decision making : an international journal of the Society for Medical Decision Making.

[9]  G A Diamond,et al.  What price perfection? Calibration and discrimination of clinical prediction models. , 1992, Journal of clinical epidemiology.

[10]  N. Cook Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction , 2007, Circulation.

[11]  C. Metz,et al.  "Proper" Binormal ROC Curves: Theory and Maximum-Likelihood Estimation. , 1999, Journal of mathematical psychology.

[12]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[13]  Christopher H. Schmid,et al.  Multivariate Classification Rules: Calibration and Discrimination , 2014 .

[14]  D. Bamber The area above the ordinal dominance graph and the area below the receiver operating characteristic graph , 1975 .

[15]  Daniel A. Keim,et al.  Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining , 2002, KDD.

[16]  Hsuan-Tien Lin,et al.  A note on Platt’s probabilistic outputs for support vector machines , 2007, Machine Learning.

[17]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[18]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[19]  Karen Drukker,et al.  Semiparametric estimation of the relationship between ROC operating points and the test-result scale: application to the proper binormal model. , 2011, Academic radiology.

[20]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[21]  M. Gail,et al.  Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. , 1989, Journal of the National Cancer Institute.

[22]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[23]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[24]  M. Pepe The Statistical Evaluation of Medical Tests for Classification and Prediction , 2003 .

[25]  James P. Egan,et al.  Signal detection theory and ROC analysis , 1975 .

[26]  Keinosuke Fukunaga,et al.  Effects of Sample Size in Classifier Design , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  M. Pencina,et al.  Net reclassification improvement: computation, interpretation, and controversies: a literature review and clinician's guide. , 2014, Annals of Internal Medicine.

[28]  Jing Fan,et al.  The Net Reclassification Index (NRI): A Misleading Measure of Prediction Improvement Even with Independent Test Data Sets , 2015, Statistics in biosciences.

[29]  N. Obuchowski,et al.  Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures , 2010, Epidemiology.

[30]  Frank W. Samuelson,et al.  Investigation of methods for calibration of classifier scores to probability of disease , 2015, Medical Imaging.

[31]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[32]  Thomas A Gerds,et al.  A note on the evaluation of novel biomarkers: do not rely on integrated discrimination improvement and net reclassification index , 2014, Statistics in medicine.

[33]  Luc De Raedt,et al.  Proceedings of the 22nd international conference on Machine learning , 2005 .

[34]  Kunio Doi,et al.  Computer-aided diagnosis in medical imaging: Historical review, current status and future potential , 2007, Comput. Medical Imaging Graph..

[35]  Xiao-Hua Zhou,et al.  Statistical Methods in Diagnostic Medicine , 2002 .

[36]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[37]  H. D. Brunk,et al.  AN EMPIRICAL DISTRIBUTION FUNCTION FOR SAMPLING WITH INCOMPLETE INFORMATION , 1955 .

[38]  Jun S. Liu,et al.  Linear Combinations of Multiple Diagnostic Markers , 1993 .

[39]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[40]  Chris Lloyd,et al.  Estimation of a convex ROC curve , 2002 .

[41]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .