Multi-Reader Multi-Case Studies Using the Area under the Receiver Operator Characteristic Curve as a Measure of Diagnostic Accuracy: Systematic Review with a Focus on Quality of Data Reporting

Introduction We examined the design, analysis and reporting in multi-reader multi-case (MRMC) research studies using the area under the receiver-operating curve (ROC AUC) as a measure of diagnostic performance. Methods We performed a systematic literature review from 2005 to 2013 inclusive to identify a minimum 50 studies. Articles of diagnostic test accuracy in humans were identified via their citation of key methodological articles dealing with MRMC ROC AUC. Two researchers in consensus then extracted information from primary articles relating to study characteristics and design, methods for reporting study outcomes, model fitting, model assumptions, presentation of results, and interpretation of findings. Results were summarized and presented with a descriptive analysis. Results Sixty-four full papers were retrieved from 475 identified citations and ultimately 49 articles describing 51 studies were reviewed and extracted. Radiological imaging was the index test in all. Most studies focused on lesion detection vs. characterization and used less than 10 readers. Only 6 (12%) studies trained readers in advance to use the confidence scale used to build the ROC curve. Overall, description of confidence scores, the ROC curve and its analysis was often incomplete. For example, 21 (41%) studies presented no ROC curve and only 3 (6%) described the distribution of confidence scores. Of 30 studies presenting curves, only 4 (13%) presented the data points underlying the curve, thereby allowing assessment of extrapolation. The mean change in AUC was 0.05 (−0.05 to 0.28). Non-significant change in AUC was attributed to underpowering rather than the diagnostic test failing to improve diagnostic accuracy. Conclusions Data reporting in MRMC studies using ROC AUC as an outcome measure is frequently incomplete, hampering understanding of methods and the reliability of results and study conclusions. Authors using this analysis should be encouraged to provide a full description of their methods and results.

[1]  David Gur,et al.  "Binary" and "non-binary" detection tasks: are current performance measures optimal? , 2007, Academic radiology.

[2]  Mitsuru Ikeda,et al.  Flat-Panel Detector Computed Tomography Imaging: Observer Performance in Detecting Pulmonary Nodules in Comparison With Conventional Chest Radiography and Multidetector Computed Tomography , 2012, Journal of thoracic imaging.

[3]  Lubomir M. Hadjiiski,et al.  Digital breast tomosynthesis is comparable to mammographic spot views for mass characterization. , 2012, Radiology.

[4]  Lubomir M. Hadjiiski,et al.  Breast Mass Characterization Using 3‐Dimensional Automated Ultrasound as an Adjunct to Digital Breast Tomosynthesis , 2013, Journal of ultrasound in medicine : official journal of the American Institute of Ultrasound in Medicine.

[5]  M. Chung,et al.  The influence of liquid crystal display monitors on observer performance for the detection of interstitial lung markings on both storage phosphor and flat-panel-detector chest radiography. , 2010, European journal of radiology.

[6]  Du-Yih Tsai,et al.  Usefulness of Z-Score Mapping for Quantification of Extent of Hypoattenuation Regions of Hyperacute Stroke in Unenhanced Computed Tomography: Analysis of Radiologists' Performance , 2010, Journal of computer assisted tomography.

[7]  Michel Bilello,et al.  An approach to comparing accuracies of two FLAIR MR sequences in the detection of multiple sclerosis lesions in the brain in the absence of gold standard. , 2010, Academic radiology.

[8]  Shingo Iwano,et al.  Diagnostic value of SPIO-mediated breath-hold, black-blood, fluid-attenuated, inversion recovery (BH-BB-FLAIR) imaging in patients with hepatocellular carcinomas. , 2010, Magnetic resonance in medical sciences : MRMS : an official journal of Japan Society of Magnetic Resonance in Medicine.

[9]  Nancy A Obuchowski,et al.  Effect of computer-aided detection for CT colonography in a multireader, multicase trial. , 2010, Radiology.

[10]  Klaus Bacher,et al.  Digital tomosynthesis in the detection of urolithiasis: Diagnostic performance and dosimetry compared with digital radiography with MDCT as the reference standard. , 2010, AJR. American journal of roentgenology.

[11]  Hiroshi Fujita,et al.  Computer-Aided Diagnosis for Detection of Lacunar Infarcts on MR Images: ROC Analysis of Radiologists’ Performance , 2012, Journal of Digital Imaging.

[12]  Kyle J Myers,et al.  Evaluating imaging and computer-aided detection and diagnosis devices at the FDA. , 2012, Academic radiology.

[13]  Nobuhiro Oda,et al.  Usefulness of computerized method for lung nodule detection in digital chest radiographs using temporal subtraction images. , 2011, Academic radiology.

[14]  Philip Murphy,et al.  Perceived Sufficiency of Full-Field Digital Mammograms With and Without Irreversible Image Data Compression for Comparison with Next-Year Mammograms , 2010, Journal of Digital Imaging.

[15]  Gisella Gennaro,et al.  Digital breast tomosynthesis versus digital mammography: a clinical performance study , 2010, European Radiology.

[16]  N. Obuchowski Receiver operating characteristic curves and their use in radiology. , 2003, Radiology.

[17]  F. Diekmann,et al.  Diagnostic performance of a near-infrared breast imaging system as adjunct to mammography versus X-ray mammography alone , 2012, European Radiology.

[18]  Charbel Saade,et al.  A reduced contrast volume acquisition regimen based on cardiovascular dynamics improves visualisation of head and neck vasculature with carotid MDCT angiography. , 2013, European journal of radiology.

[19]  Mats Lundqvist,et al.  Comparison of radiologist performance with photon-counting full-field digital mammography to conventional full-field digital mammography. , 2012, Academic radiology.

[20]  Nico Karssemeijer,et al.  Computer-Aided Lesion Diagnosis in Automated 3-D Breast Ultrasound Using Coronal Spiculation , 2012, IEEE Transactions on Medical Imaging.

[21]  Felix Diekmann,et al.  Dual-energy contrast-enhanced digital mammography: initial clinical results of a multireader, multicase study , 2012, Breast Cancer Research.

[22]  Eiji Kikuchi,et al.  Comparison of CT urography and excretory urography in the detection and localization of urothelial carcinoma of the upper urinary tract. , 2011, AJR. American journal of roentgenology.

[23]  K. Berbaum,et al.  Receiver operating characteristic rating analysis. Generalization to the population of readers and patients with the jackknife method. , 1992, Investigative radiology.

[24]  Hiroaki Sugiura,et al.  Fast Scanning Tomosynthesis for the Detection of Pulmonary Nodules: Diagnostic Performance Compared With Chest Radiography, Using Multidetector-Row Computed Tomography as the Reference , 2011, Investigative radiology.

[25]  Jeffrey Hoffmeister,et al.  What's the control in studies measuring the effect of computer aided detection (CAD) on observer performance? , 2010 .

[26]  Nico Karssemeijer,et al.  Computer-aided detection of masses at mammography: interactive decision support versus prompts. , 2013, Radiology.

[27]  Kunio Doi,et al.  Presentation of Similar Images as a Reference for Distinction Between Benign and Malignant Masses on Mammograms: Analysis of Initial Observer Study , 2009, Journal of Digital Imaging.

[28]  S Maryland,et al.  Small lung cancers:improved detection by use of bone suppression imaging——comparison with dual-energy subtraction chest radiography , 2012 .

[29]  Hilde Bosmans,et al.  Effect of image quality on calcification detection in digital mammography. , 2012, Medical physics.

[30]  E. Halpern,et al.  Assessing radiologist performance using combined digital mammography and breast tomosynthesis compared with digital mammography alone: results of a multicenter, multireader trial. , 2013, Radiology.

[31]  Joanna M. Wardlaw,et al.  A Large Web-Based Observer Reliability Study of Early Ischaemic Signs on Computed Tomography. The Acute Cerebral CT Evaluation of Stroke Study (ACCESS) , 2010, PloS one.

[32]  Kei Yamada,et al.  Incremental value of T2-weighted and diffusion-weighted MRI for prediction of biochemical recurrence after radical prostatectomy in clinically localized prostate cancer , 2011, Acta radiologica.

[33]  M. Su,et al.  Evaluation of Clinical Breast MR Imaging Performed with Prototype Computer-aided Diagnosis Breast MR Imaging Workstation: Reader Study , 2012 .

[34]  N A Obuchowski,et al.  Multireader receiver operating characteristic studies: a comparison of study designs. , 1995, Academic radiology.

[35]  Stephen L Hillis,et al.  Monte Carlo validation of the Dorfman-Berbaum-Metz method using normalized pseudovalues and less data-based model simplification. , 2005, Academic radiology.

[36]  Jin-Suck Suh,et al.  A comparison of the diagnostic performances of visceral organ-targeted versus spine-targeted protocols for the evaluation of spinal fractures using sixteen-channel multidetector row computed tomography: is additional spine-targeted computed tomography necessary to evaluate thoracolumbar spinal fract , 2010, The Journal of trauma.

[37]  K S Berbaum,et al.  A contaminated binormal model for ROC data: Part I. Some interesting examples of binormal degeneracy. , 2000, Academic radiology.

[38]  Michael G Evanoff,et al.  Flexible image evaluation: iPad versus secondary-class monitors for review of MR spinal emergency cases, a comparative study. , 2012, Academic radiology.

[39]  Heber MacMahon,et al.  Improved detection of focal pneumonia by chest radiography with bone suppression imaging , 2012, European Radiology.

[40]  Masaaki Kawahara,et al.  Diagnostic accuracy of ultra-high-b-value 3.0-T diffusion-weighted MR imaging for detection of prostate cancer. , 2012, Clinical imaging.

[41]  K S Berbaum,et al.  Monte Carlo validation of a multireader method for receiver operating characteristic discrete rating data: factorial experimental design. , 1998, Academic radiology.

[42]  Kunio Doi,et al.  Experimental design and data analysis in receiver operating characteristic studies: lessons learned from reports in radiology from 1997 to 2006. , 2009, Radiology.

[43]  T M Svahn,et al.  Breast tomosynthesis and digital mammography: a comparison of diagnostic accuracy. , 2012, The British journal of radiology.

[44]  Nico Karssemeijer,et al.  Computer-Aided Diagnosis With Temporal Analysis to Improve Radiologists’ Interpretation of Mammographic Mass Lesions , 2010, IEEE Transactions on Information Technology in Biomedicine.

[45]  Louis D. Silverstein,et al.  Observer Performance Using Virtual Pathology Slides: Impact of LCD Color Reproduction Accuracy , 2012, Journal of Digital Imaging.

[46]  Steve Halligan,et al.  Exploration of Analysis Methods for Diagnostic Imaging Tests: Problems with ROC AUC and Confidence Scores in CT Colonography , 2014, PloS one.

[47]  Mark F McEntee,et al.  The effect of abnormality-prevalence expectation on expert observer performance and visual search. , 2011, Radiology.

[48]  Federica Zanca,et al.  Two-view and single-view tomosynthesis versus full-field digital mammography: high-resolution X-ray imaging observer study. , 2012, Radiology.

[49]  Maria Triantafyllou,et al.  Detection of chest trauma with whole-body low-dose linear slit digital radiography: a multireader study. , 2010, AJR. American journal of roentgenology.

[50]  Kunio Doi,et al.  Improved detection of subtle lung nodules by use of chest radiographs with bone suppression imaging: receiver operating characteristic analysis with and without localization. , 2011, AJR. American journal of roentgenology.

[51]  Masayuki Sasaki,et al.  Effect of Dose Reduction on the Ability of Digital Mammography to Detect Simulated Microcalcifications , 2010, Journal of Digital Imaging.

[52]  Xiao-Hua Zhou,et al.  Statistical Methods in Diagnostic Medicine , 2002 .

[53]  Kevin S. Berbaum,et al.  Satisfaction of search for subtle skeletal fractures may not be induced by more serious skeletal injury. , 2012, Journal of the American College of Radiology : JACR.

[54]  Kunio Doi,et al.  Observer Study for Evaluating Potential Utility of a Super-High-Resolution LCD in the Detection of Clustered Microcalcifications on Digital Mammograms , 2009, Journal of Digital Imaging.

[55]  S. Hillis A comparison of denominator degrees of freedom methods for multiple observer ROC analysis , 2007, Statistics in medicine.

[56]  Ehsan Samei,et al.  The Effects of Ambient Lighting in Chest Radiology Reading Rooms , 2012, Journal of Digital Imaging.

[57]  Hiroki Shirato,et al.  Detection of Brain Metastases by 3-Dimensional Magnetic Resonance Imaging at 3 T: Comparison Between T1-Weighted Volume Isotropic Turbo Spin Echo Acquisition and 3-Dimensional T1-Weighted Fluid-Attenuated Inversion Recovery Imaging , 2013, Journal of computer assisted tomography.

[58]  W. Scott Comulada,et al.  Breast cancer detection: radiologists’ performance using mammography with and without automated whole-breast ultrasound , 2010, European Radiology.

[59]  E. Kazerooni,et al.  Computer-aided diagnosis of lung nodules on CT scans: ROC study of its effect on radiologists' performance. , 2010, Academic radiology.

[60]  N. Obuchowski,et al.  Computer-aided detection of colorectal polyps: can it improve sensitivity of less-experienced readers? Preliminary findings. , 2007, Radiology.

[61]  Paymann Moin,et al.  An observer study for a computer-aided reading protocol (CARP) in the screening environment for digital mammography. , 2011, Academic radiology.

[62]  Stephen L Hillis,et al.  Power estimation for the Dorfman-Berbaum-Metz method. , 2004, Academic radiology.

[63]  Dongil Choi,et al.  High-risk esophageal varices in patients treated with locoregional therapy for hepatocellular carcinoma: assessment with liver computed tomography. , 2012, World journal of gastroenterology.

[64]  Nancy A Obuchowski,et al.  MRI of the knee ligaments and menisci: comparison of isotropic-resolution 3D and conventional 2D fast spin-echo sequences at 3 T. , 2011, AJR. American journal of roentgenology.

[65]  K. Ast,et al.  Diagnostic Efficacy of Handheld Devices for Emergency Radiologic Consultation , 2010 .

[66]  András Kocsor,et al.  ROC analysis: applications to the classification of biological sequences and 3D structures , 2008, Briefings Bioinform..

[67]  Nobuhiro Oda,et al.  Usefulness of computerized method for lung nodule detection on digital chest radiographs using similar subtraction images from different patients. , 2012, European journal of radiology.

[68]  J. Hanley The Robustness of the "Binormal" Assumptions Used in Fitting ROC Curves , 1988, Medical decision making : an international journal of the Society for Medical Decision Making.

[69]  D. Moher,et al.  Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. , 2010, International journal of surgery.

[70]  Yasunari Yamada,et al.  Extrahepatic bile duct cancer: invasion of the posterior hepatic plexuses--evaluation using multidetector CT. , 2012, Radiology.

[71]  Nancy A Obuchowski,et al.  Characteristics and distinguishing features of hepatocellular adenoma and focal nodular hyperplasia on gadoxetate disodium-enhanced MRI. , 2012, AJR. American journal of roentgenology.

[72]  Nico Karssemeijer,et al.  Increase in perceived case suspiciousness due to local contrast optimisation in digital screening mammography , 2011, European Radiology.

[73]  N. Obuchowski,et al.  Hypothesis testing of diagnostic accuracy for multiple readers and multiple tests: An anova approach with dependent observations , 1995 .

[74]  Diego Andrés Aguirre,et al.  Comparison between differently priced devices for digital capture of X-ray films using computed tomography as a gold standard: a multireader-multicase receiver operating characteristic curve study. , 2011, Telemedicine journal and e-health : the official journal of the American Telemedicine Association.

[75]  Patrick C. Brennan,et al.  Verification of DICOM GSDF in Complex Backgrounds , 2012, Journal of Digital Imaging.

[76]  Bo Seung Kang,et al.  Remote CT reading using an ultramobile PC and web-based remote viewing over a wireless network , 2012, Journal of telemedicine and telecare.

[77]  Michael B. Harrington Some methodological questions concerning receiver operating characteristic (ROC) analysis as a method for assessing image quality in radiology , 2009, Journal of Digital Imaging.

[78]  R. F. Wagner,et al.  Multireader, multicase receiver operating characteristic analysis: an empirical comparison of five methods. , 2004, Academic radiology.

[79]  Etta D Pisano,et al.  Comparative effectiveness of positron emission mammography and MRI in the contralateral breast of women with newly diagnosed breast cancer. , 2012, AJR. American journal of roentgenology.

[80]  Nancy A Obuchowski,et al.  A comparison of the Dorfman–Berbaum–Metz and Obuchowski–Rockette methods for receiver operating characteristic (ROC) data , 2005, Statistics in medicine.

[81]  Stuart A. Taylor,et al.  Computed tomographic colonography: assessment of radiologist performance with and without computer-aided detection. , 2006, Gastroenterology.

[82]  David Gur,et al.  Comparing areas under receiver operating characteristic curves: potential impact of the "Last" experimentally measured operating point. , 2008, Radiology.

[83]  Kevin S. Berbaum,et al.  A contaminated binormal model for ROC data , 2000 .

[84]  Hilde Bosmans,et al.  Correlation of free-response and receiver-operating-characteristic area-under-the-curve estimates: results from independently conducted FROC∕ROC studies in mammography. , 2012, Medical physics.

[85]  Hiroyuki Abe,et al.  Non-contrast enhanced MRI for evaluation of breast lesions: comparison of non-contrast enhanced high spectral and spatial resolution (HiSS) images versus contrast enhanced fat-suppressed images. , 2011, Academic radiology.

[86]  Ehsan Samei,et al.  Comparative performance of multiview stereoscopic and mammographic display modalities for breast lesion detection. , 2011, Medical physics.

[87]  R. F. Wagner,et al.  Assessment of medical imaging systems and computer aids: a tutorial review. , 2007, Academic radiology.

[88]  N. Petrick,et al.  CT colonography with computer-aided detection as a second reader: observer performance study. , 2008, Radiology.

[89]  D. Rennie,et al.  Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative , 2003, BMJ : British Medical Journal.

[90]  Christer Ullberg,et al.  Clinical experience of photon counting breast tomosynthesis: comparison with traditional mammography , 2011, Acta radiologica.

[91]  Tohru Shiga,et al.  Semiquantitative analysis of C-11 methionine PET may distinguish brain tumor recurrence from radiation necrosis even in small lesions , 2011, Annals of nuclear medicine.