Roc Confidence Bands: An Empirical Study

This paper is about constructing confidence bands around an ROCcurve such that (1 - \delta)% of the ROC curves traced by data setsof size r will fall completely within the bands. We introduce tothe machine learning community three methods from the medicalfield that are applicable to generate such bands. We then evaluatethese methods on the simple case of A'Â'A ¢A' A ¬A' A ÂSbinormalA'Â'A ¢A' A ¬A' A Â? distributionsA'Â'A ¢A' A ¬A' A  the scores for positive and the score for negative instances aredrawn from two normal distributions. We show that none of themethods generate appropriate bands and investigate two types ofvariances problems. We show that widening the bands does notproduce the proper bandwidths but that fitting a normal distributionto the observed drawn samples and drawing samples fromthis distribution (parametric bootstrap) does generate bands thatare much closer to the desired coverage although still not perfect.We tested the original methods as well as parametric bootstrap onthe covertype data set from the UCI ML-repority. The originalmethods perform the same as in the synthetic case, whereas theparametric bootstrap technique did not yield the expected results.This is primarily due to not being able to generate a good fit forthe score distributions. Whether it is possible to fit well-behavingparametric distribution to learned models is an open question weleave to the machine learning community to answer.

[1]  Roger M. Stein Benchmarking default prediction models: pitfalls and remedies in model validation , 2007 .

[2]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[3]  Eibe Frank,et al.  Logistic Model Trees , 2003, Machine Learning.

[4]  Foster J. Provost,et al.  Confidence Bands for Roc Curves , 2004, ROCAI.

[5]  Paul N. Bennett Using asymmetric distributions to improve text classifier probability estimates , 2003, SIGIR.

[6]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Data Mining Researchers , 2003 .

[7]  Thomas G. Dietterich,et al.  Bootstrap Methods for the Cost-Sensitive Evaluation of Classifiers , 2000, ICML.

[8]  D. Groggel Practical Nonparametric Statistics , 1972, Technometrics.

[9]  Bengt Hallmans,et al.  Introduction to BOOT , 1999 .

[10]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[11]  C. Metz,et al.  Maximum likelihood estimation of receiver operating characteristic (ROC) curves from continuously-distributed data. , 1998, Statistics in medicine.

[12]  C A Roe,et al.  Statistical Comparison of Two ROC-curve Estimates Obtained from Partially-paired Datasets , 1998, Medical decision making : an international journal of the Society for Medical Decision Making.

[13]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[14]  K. Zou,et al.  Smooth non-parametric receiver operating characteristic (ROC) curves for continuous diagnostic tests. , 1997, Statistics in medicine.

[15]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[16]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[17]  G. Campbell,et al.  Advances in statistical methodology for the evaluation of diagnostic and laboratory tests. , 1994, Statistics in medicine.

[18]  W. Hall,et al.  Confidence Bands for Receiver Operating Characteristic Curves , 1993, Medical decision making : an international journal of the Society for Medical Decision Making.

[19]  M. Zweig,et al.  Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. , 1993, Clinical chemistry.

[20]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[21]  J R Beck,et al.  The use of relative operating characteristic (ROC) curves in test performance evaluation. , 1986, Archives of pathology & laboratory medicine.

[22]  D. Dorfman,et al.  Maximum-likelihood estimation of parameters of signal-detection theory and determination of confidence intervals—Rating-method data , 1969 .