Pointwise ROC Confidence Bounds: An Empirical Evaluation

This paper is about constructing and evaluating pointwise confidence bounds on an ROC curve. We describe four confidencebound methods, two from the medical field and two used previously in machine learning research. We evaluate whether the bounds indeed contain the relevant operating point on the “true” ROC curve with a confidence of 1−δ. We then evaluate pointwise confidence bounds on the region where the future performance of a model is expected to lie. For evaluation we use a synthetic world representing “binormal” distributions–the classification scores for positive and negative instances are drawn from (separate) normal distributions. For the “true-curve” bounds, all methods are sensitive to how well the distributions are separated, which corresponds directly to the area under the ROC curve. One method produces bounds that are universally too loose, another universally too tight, and the remaining two are close to the desired containment although containment breaks down at the extremes of the ROC curve. As would be expected, all methods fail when used to contain “future” ROC curves. Widening the bounds to account for the increased uncertainty yields identical qualitative results to the “true-curve” evaluation. We conclude by recommending a simple, very efficient method (vertical averaging) for large sample sizes and a more computationally expensive method (kernel estimation) for small sample sizes.

[1]  D. Dorfman,et al.  Maximum-likelihood estimation of parameters of signal-detection theory and determination of confidence intervals—Rating-method data , 1969 .

[2]  J R Beck,et al.  The use of relative operating characteristic (ROC) curves in test performance evaluation. , 1986, Archives of pathology & laboratory medicine.

[3]  R. Hilgers Distribution-Free Confidence Bounds for ROC Curves , 1991, Methods of Information in Medicine.

[4]  M. Zweig,et al.  Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. , 1993, Clinical chemistry.

[5]  W. Hall,et al.  Confidence Bands for Receiver Operating Characteristic Curves , 1993, Medical decision making : an international journal of the Society for Medical Decision Making.

[6]  G. Campbell,et al.  Advances in statistical methodology for the evaluation of diagnostic and laboratory tests. , 1994, Statistics in medicine.

[7]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[8]  C A Roe,et al.  Statistical Comparison of Two ROC-curve Estimates Obtained from Partially-paired Datasets , 1998, Medical decision making : an international journal of the Society for Medical Decision Making.

[9]  Jonathan M. Garibaldi,et al.  Receiver operating characteristic analysis for intelligent medical systems-a new approach for finding confidence intervals , 2000, IEEE Trans. Biomed. Eng..

[10]  J. Garibaldi,et al.  Receiver Operating Characteristic analysis for Intelligent Medical Systems – a new approach for finding non – parametric confidence intervals , 2000 .

[11]  Thomas G. Dietterich,et al.  Bootstrap Methods for the Cost-Sensitive Evaluation of Classifiers , 2000, ICML.

[12]  Paul N. Bennett Using asymmetric distributions to improve text classifier probability estimates , 2003, SIGIR.

[13]  Gerda Claeskens,et al.  Empirical likelihood confidence regions for comparison distributions and roc curves , 2003 .

[14]  Rob J Hyndman,et al.  Improved methods for bandwidth selection when estimating ROC curves , 2003 .

[15]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Data Mining Researchers , 2003 .

[16]  Foster J. Provost,et al.  Confidence Bands for Roc Curves , 2004, ROCAI.

[17]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[18]  Rob J Hyndman,et al.  Nonparametric confidence intervals for receiver operating characteristic curves , 2004 .

[19]  Foster J. Provost,et al.  ROC confidence bands: an empirical evaluation , 2005, ICML.

[20]  Roger M. Stein Benchmarking default prediction models: pitfalls and remedies in model validation , 2007 .

[21]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .