Exploration of Analysis Methods for Diagnostic Imaging Tests: Problems with ROC AUC and Confidence Scores in CT Colonography

Background Different methods of evaluating diagnostic performance when comparing diagnostic tests may lead to different results. We compared two such approaches, sensitivity and specificity with area under the Receiver Operating Characteristic Curve (ROC AUC) for the evaluation of CT colonography for the detection of polyps, either with or without computer assisted detection. Methods In a multireader multicase study of 10 readers and 107 cases we compared sensitivity and specificity, using radiological reporting of the presence or absence of polyps, to ROC AUC calculated from confidence scores concerning the presence of polyps. Both methods were assessed against a reference standard. Here we focus on five readers, selected to illustrate issues in design and analysis. We compared diagnostic measures within readers, showing that differences in results are due to statistical methods. Results Reader performance varied widely depending on whether sensitivity and specificity or ROC AUC was used. There were problems using confidence scores; in assigning scores to all cases; in use of zero scores when no polyps were identified; the bimodal non-normal distribution of scores; fitting ROC curves due to extrapolation beyond the study data; and the undue influence of a few false positive results. Variation due to use of different ROC methods exceeded differences between test results for ROC AUC. Conclusions The confidence scores recorded in our study violated many assumptions of ROC AUC methods, rendering these methods inappropriate. The problems we identified will apply to other detection studies using confidence scores. We found sensitivity and specificity were a more reliable and clinically appropriate method to compare diagnostic tests.

[1]  K. Berbaum,et al.  Receiver operating characteristic rating analysis. Generalization to the population of readers and patients with the jackknife method. , 1992, Investigative radiology.

[2]  H. Hussain,et al.  T2-weighted MR imaging in the assessment of cirrhotic liver. , 2004, Radiology.

[3]  Niall M. Adams,et al.  An improved measure for comparing diagnostic tests , 2000, Comput. Biol. Medicine.

[4]  Hans Roehrig,et al.  Using a human visual system model to optimize soft-copy mammography display: influence of veiling glare. , 2006, Academic radiology.

[5]  V. Gupta,et al.  The mathematical structure of rainfall representations: 1. A review of the stochastic rainfall models , 1981 .

[6]  M. Pencina,et al.  Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond , 2008, Statistics in medicine.

[7]  J D Habbema,et al.  Application of Treatment Thresholds to Diagnostic-test Evaluation , 1997, Medical decision making : an international journal of the Society for Medical Decision Making.

[8]  J. Ware,et al.  Comments on ‘Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond’ by M. J. Pencina et al., Statistics in Medicine (DOI: 10.1002/sim.2929) , 2008, Statistics in medicine.

[9]  C E Metz,et al.  Some practical issues of experimental design and data analysis in radiological ROC studies. , 1989, Investigative radiology.

[10]  R. Schmidt,et al.  Comparison of independent double readings and computer-aided diagnosis (CAD) for the diagnosis of breast calcifications. , 2006, Academic radiology.

[11]  David Gur,et al.  "Binary" and "non-binary" detection tasks: are current performance measures optimal? , 2007, Academic radiology.

[12]  David Gur,et al.  Comparing areas under receiver operating characteristic curves: potential impact of the "Last" experimentally measured operating point. , 2008, Radiology.

[13]  C. von Wagner,et al.  Patients' & Healthcare Professionals' Values Regarding True- & False-Positive Diagnosis when Colorectal Cancer Screening by CT Colonography: Discrete Choice Experiment , 2013, PloS one.

[14]  J. Hilden The Area under the ROC Curve and Its Competitors , 1991, Medical decision making : an international journal of the Society for Medical Decision Making.

[15]  A. Agresti Modelling ordered categorical data: recent advances and future challenges. , 1999, Statistics in medicine.

[16]  J M Lewin,et al.  Comparison of full-field digital mammography with screen-film mammography for cancer detection: results of 4,945 paired examinations. , 2001, Radiology.

[17]  C. Metz ROC Methodology in Radiologic Imaging , 1986, Investigative radiology.

[18]  Nancy A Obuchowski,et al.  Multi-reader ROC studies with split-plot designs: a comparison of statistical methods. , 2012, Academic radiology.

[19]  C E Metz,et al.  The "proper" binormal model: parametric receiver operating characteristic curve estimation with degenerate data. , 1997, Academic radiology.

[20]  N. Petrick,et al.  CT colonography with computer-aided detection as a second reader: observer performance study. , 2008, Radiology.

[21]  David Gur,et al.  A permutation test for comparing ROC curves in multireader studies a multi-reader ROC, permutation test. , 2006, Academic radiology.

[22]  Elena B. Elkin,et al.  Extensions to decision curve analysis, a novel method for evaluating diagnostic tests, prediction models and molecular markers , 2008, BMC Medical Informatics Decis. Mak..

[23]  Benjamin M Yeh,et al.  Peripheral zone prostate cancer: accuracy of different interpretative approaches with MR and MR spectroscopic imaging. , 2008, Radiology.

[24]  Kevin S. Berbaum,et al.  A contaminated binormal model for ROC data , 2000 .

[25]  Michael B. Harrington Some methodological questions concerning receiver operating characteristic (ROC) analysis as a method for assessing image quality in radiology , 2009, Journal of Digital Imaging.

[26]  N. Obuchowski,et al.  Computer-aided detection of colorectal polyps: can it improve sensitivity of less-experienced readers? Preliminary findings. , 2007, Radiology.

[27]  Joon Seok Lim,et al.  Preoperative MRI of rectal cancer with and without rectal water filling: an intraindividual comparison. , 2004, AJR. American journal of roentgenology.

[28]  David J. Hand,et al.  ROC Curves for Continuous Data , 2009 .

[29]  Brandon D Gallas,et al.  One-shot estimate of MRMC variance: AUC. , 2006, Academic radiology.

[30]  J. Hanley The Robustness of the "Binormal" Assumptions Used in Fitting ROC Curves , 1988, Medical decision making : an international journal of the Society for Medical Decision Making.

[31]  K S Berbaum,et al.  A contaminated binormal model for ROC data: Part I. Some interesting examples of binormal degeneracy. , 2000, Academic radiology.

[32]  Nancy R Cook,et al.  Comments on ‘Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond’ by M. J. Pencina et al., Statistics in Medicine (DOI: 10.1002/sim.2929) , 2008, Statistics in medicine.

[33]  John Eng,et al.  Teaching receiver operating characteristic analysis: an interactive laboratory exercise. , 2012, Academic radiology.

[34]  András Kocsor,et al.  ROC analysis: applications to the classification of biological sequences and 3D structures , 2008, Briefings Bioinform..

[35]  Stuart A. Taylor,et al.  Computed tomographic colonography: assessment of radiologist performance with and without computer-aided detection. , 2006, Gastroenterology.

[36]  R. F. Wagner,et al.  Components-of-variance models and multiple-bootstrap experiments: an alternative method for random-effects, receiver operating characteristic analysis. , 2000, Academic radiology.

[37]  J A Swets,et al.  Form of empirical ROCs in discrimination and diagnostic tasks: implications for theory and measurement of performance. , 1986, Psychological bulletin.

[38]  R. F. Wagner,et al.  Multireader, multicase receiver operating characteristic analysis: an empirical comparison of five methods. , 2004, Academic radiology.

[39]  Steve Halligan,et al.  Incremental benefit of computer-aided detection when used as a second and concurrent reader of CT colonographic data: multiobserver study. , 2011, Radiology.

[40]  E. Krupinski,et al.  Anniversary paper: evaluation of medical imaging systems. , 2008, Medical physics.

[41]  Gary S Collins,et al.  Interpreting diagnostic accuracy studies for patient care , 2012, BMJ : British Medical Journal.

[42]  N A Obuchowski,et al.  Nonparametric analysis of clustered ROC curve data. , 1997, Biometrics.