Is an ROC-type response truly always better than a binary response in observer performance studies?

RATIONALE AND OBJECTIVES The aim of this study was to assess similarities and differences between methods of performance comparisons under binary (yes or no) and receiver-operating characteristic (ROC)-type pseudocontinuous (0-100) rating data ascertained during an observer performance study of interpretation of full-field digital mammography (FFDM) versus FFDM plus digital breast tomosynthesis. MATERIALS AND METHODS Rating data consisted of ROC-type pseudocontinuous and binary ratings generated by eight radiologists evaluating 77 digital mammographic examinations. Overall performance levels were summarized with a conventionally used probability of correct discrimination or, equivalently, the area under the ROC curve (AUC), which under a binary scale is related to Youden's index. Magnitudes of differences in the reader-averaged empirical AUCs between FFDM alone and FFDM plus digital breast tomosynthesis were compared in the context of fixed-reader and random-reader variability of the estimates. RESULTS The absolute differences between modes using the empirical AUCs were larger on average for the binary scale (0.12 vs 0.07) and for the majority of individual readers (six of eight). Standardized differences were consistent with this finding (2.32 vs 1.63 on average). Reader-averaged differences in AUCs standardized by fixed-reader and random-reader variances were also smaller under the binary rating paradigm. The discrepancy between AUC differences depended on the location of the reader-specific binary operating points. CONCLUSIONS The human observer's operating point should be a primary consideration in designing an observer performance study. Although in general, the ROC-type rating paradigm provides more detailed information on the characteristics of different modes, it does not reflect the actual operating point adopted by human observers. There are application-driven scenarios in which analysis based on binary responses may provide statistical advantages.

[1]  David Gur,et al.  "Binary" and "non-binary" detection tasks: are current performance measures optimal? , 2007, Academic radiology.

[2]  K. Berbaum,et al.  Receiver operating characteristic rating analysis. Generalization to the population of readers and patients with the jackknife method. , 1992, Investigative radiology.

[3]  Kevin S Berbaum,et al.  An empirical comparison of discrete ratings and subjective probability ratings. , 2002, Academic radiology.

[4]  N. Obuchowski,et al.  Hypothesis testing of diagnostic accuracy for multiple readers and multiple tests: An anova approach with dependent observations , 1995 .

[5]  David Chia,et al.  Mortality results from a randomized prostate-cancer screening trial. , 2009, The New England journal of medicine.

[6]  Stephen M. Moore,et al.  Collecting 48,000 CT Exams for the Lung Screening Study of the National Lung Screening Trial , 2009, Journal of Digital Imaging.

[7]  David Gur,et al.  Digital breast tomosynthesis: a pilot observer study. , 2008, AJR. American journal of roentgenology.

[8]  Kevin S. Berbaum,et al.  Satisfaction of search in diagnostic radiology. , 1989 .

[9]  Nancy A Obuchowski,et al.  A comparison of the Dorfman–Berbaum–Metz and Obuchowski–Rockette methods for receiver operating characteristic (ROC) data , 2005, Statistics in medicine.

[10]  Luisa P. Wallace,et al.  Changes in breast cancer detection and mammography recall rates after the introduction of a computer-aided detection system. , 2004, Journal of the National Cancer Institute.

[11]  Xiao-Hua Zhou,et al.  Statistical Methods in Diagnostic Medicine , 2002 .

[12]  R. F. Wagner,et al.  A Framework for Random-Effects ROC Analysis: Biases with the Bootstrap and Other Variance Estimators , 2009 .

[13]  J Hilden,et al.  Regret graphs, diagnostic uncertainty and Youden's Index. , 1996, Statistics in medicine.

[14]  H E Rockette,et al.  The use of continuous and discrete confidence judgments in receiver operating characteristic studies of diagnostic imaging techniques. , 1992, Investigative radiology.

[15]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[16]  David Gur,et al.  The prevalence effect in a laboratory environment: Changing the confidence ratings. , 2007, Academic radiology.

[17]  C. D'Orsi,et al.  Influence of computer-aided detection on performance of screening mammography. , 2007, The New England journal of medicine.

[18]  A R Feinstein,et al.  Context bias. A problem in diagnostic radiology. , 1996, JAMA.

[19]  C. Gatsonis,et al.  MRI evaluation of the contralateral breast in women with recently diagnosed breast cancer. , 2007, The New England journal of medicine.

[20]  Luisa P. Wallace,et al.  The "laboratory" effect: comparing radiologists' performance and variability during prospective clinical and laboratory mammography interpretations. , 2008, Radiology.

[21]  C. D'Orsi,et al.  Diagnostic Performance of Digital Versus Film Mammography for Breast-Cancer Screening , 2005, The New England journal of medicine.

[22]  David Gur,et al.  Digital breast tomosynthesis: observer performance study. , 2009, AJR. American journal of roentgenology.

[23]  Andriy I. Bandos,et al.  Exact Bootstrap Variances of the Area Under ROC Curve , 2007 .

[24]  Takeshi Nakaura,et al.  Pulmonary nodules: estimation of malignancy at thin-section helical CT--effect of computer-aided diagnosis on performance of radiologists. , 2006, Radiology.

[25]  David Gur,et al.  Agreement of the order of overall performance levels under different reading paradigms. , 2008, Academic radiology.