The "laboratory" effect: comparing radiologists' performance and variability during prospective clinical and laboratory mammography interpretations.

PURPOSE To compare radiologists' performance during interpretation of screening mammograms in the clinic with their performance when reading the same mammograms in a retrospective laboratory study. MATERIALS AND METHODS This study was conducted under an institutional review board-approved, HIPAA-compliant protocol; the need for informed consent was waived. Nine experienced radiologists rated an enriched set of mammograms that they had personally read in the clinic (the "reader-specific" set) mixed with an enriched "common" set of mammograms that none of the participants had previously read in the clinic by using a screening Breast Imaging Reporting and Data System (BI-RADS) rating scale. The original clinical recommendations to recall the women for a diagnostic work-up, for both reader-specific and common sets, were compared with their recommendations during the retrospective experiment. The results are presented in terms of reader-specific and group-averaged sensitivity and specificity levels and the dispersion (spread) of reader-specific performance estimates. RESULTS On average, the radiologists' performance was significantly better in the clinic than in the laboratory (P = .035). Interreader dispersion of the computed performance levels was significantly lower during the clinical interpretations (P < .01). CONCLUSION Retrospective laboratory experiments may not represent either expected performance levels or interreader variability during clinical interpretations of the same set of mammograms in the clinical environment well.

[1]  C. D'Orsi,et al.  Influence of computer-aided detection on performance of screening mammography. , 2007, The New England journal of medicine.

[2]  Sumit K. Shah,et al.  Solitary pulmonary nodule diagnosis on CT: results of an observer study. , 2005, Academic radiology.

[3]  H. Levene Robust tests for equality of variances , 1961 .

[4]  Andreas Makris,et al.  Inter‐ and intraobserver variability in the evaluation of dynamic breast cancer MRI , 2006, Journal of magnetic resonance imaging : JMRI.

[5]  Luisa P. Wallace,et al.  Changes in breast cancer detection and mammography recall rates after the introduction of a computer-aided detection system. , 2004, Journal of the National Cancer Institute.

[6]  R. F. Wagner,et al.  Multireader, multicase receiver operating characteristic analysis: an empirical comparison of five methods. , 2004, Academic radiology.

[7]  J. Elmore,et al.  Variability in radiologists' interpretations of mammograms. , 1994, The New England journal of medicine.

[8]  Emily F Conant,et al.  Association of volume and volume-independent factors with accuracy in screening mammogram interpretation. , 2003, Journal of the National Cancer Institute.

[9]  R. F. Wagner,et al.  Components-of-variance models and multiple-bootstrap experiments: an alternative method for random-effects, receiver operating characteristic analysis. , 2000, Academic radiology.

[10]  C. Rutter,et al.  Assessing mammographers' accuracy. A comparison of clinical and test performance. , 2000, Journal of clinical epidemiology.

[11]  Helen C. Cowley,et al.  Improving the accuracy of mammography: volume and outcome relationships. , 2002, Journal of the National Cancer Institute.

[12]  R F Wagner,et al.  Analysis of uncertainties in estimates of components of variance in multivariate ROC analysis. , 2001, Academic radiology.

[13]  K. Berbaum,et al.  Receiver operating characteristic rating analysis. Generalization to the population of readers and patients with the jackknife method. , 1992, Investigative radiology.

[14]  R. F. Wagner,et al.  Components-of-variance models for random-effects ROC analysis: the case of unequal variance structures across modalities. , 2001, Academic radiology.

[15]  Kunio Doi,et al.  Computer-aided diagnosis for the detection and classification of lung cancers on chest radiographs ROC analysis of radiologists' performance. , 2006, Academic radiology.

[16]  C. Beam,et al.  Variability in the interpretation of screening mammograms by US radiologists. Findings from a national sample. , 1996, Archives of internal medicine.

[17]  J. Elmore,et al.  Does diagnostic accuracy in mammography depend on radiologists' experience? , 1998, Journal of women's health.

[18]  David Gur,et al.  Prevalence effect in a laboratory environment. , 2003, Radiology.

[19]  Patricia M. Grambsch,et al.  Simple robust tests for scale differences in paired data , 1994 .

[20]  David Gur,et al.  Variability in observer performance studies experimental observations. , 2005, Academic radiology.

[21]  A R Feinstein,et al.  Context bias. A problem in diagnostic radiology. , 1996, JAMA.

[22]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[23]  Quality standards and certification requirements for mammography facilities--FDA. Interim rule with request for comments. , 1993, Federal register.

[24]  R. F. Wagner,et al.  Assessment of medical imaging and computer-assist systems: lessons from recent experience. , 2002, Academic radiology.

[25]  H E Rockette,et al.  Effect of observer instruction on ROC study of chest images. , 1990, Investigative radiology.

[26]  P. Skaane,et al.  Breast lesion detection and classification: comparison of screen-film mammography and full-field digital mammography with soft-copy reading--observer performance study. , 2005, Radiology.

[27]  David Gur,et al.  "Memory effect" in observer performance studies of mammograms. , 2005, Academic radiology.

[28]  R. F. Wagner,et al.  Assessment of medical imaging systems and computer aids: a tutorial review. , 2007, Academic radiology.