Interpretive Performance and Inter-Observer Agreement on Digital Mammography Test Sets

Objective To evaluate the interpretive performance and inter-observer agreement on digital mammographs among radiologists and to investigate whether radiologist characteristics affect performance and agreement. Materials and Methods The test sets consisted of full-field digital mammograms and contained 12 cancer cases among 1000 total cases. Twelve radiologists independently interpreted all mammograms. Performance indicators included the recall rate, cancer detection rate (CDR), positive predictive value (PPV), sensitivity, specificity, false positive rate (FPR), and area under the receiver operating characteristic curve (AUC). Inter-radiologist agreement was measured. The reporting radiologist characteristics included number of years of experience interpreting mammography, fellowship training in breast imaging, and annual volume of mammography interpretation. Results The mean and range of interpretive performance were as follows: recall rate, 7.5% (3.3–10.2%); CDR, 10.6 (8.0–12.0 per 1000 examinations); PPV, 15.9% (8.8–33.3%); sensitivity, 88.2% (66.7–100%); specificity, 93.5% (90.6–97.8%); FPR, 6.5% (2.2–9.4%); and AUC, 0.93 (0.82–0.99). Radiologists who annually interpreted more than 3000 screening mammograms tended to exhibit higher CDRs and sensitivities than those who interpreted fewer than 3000 mammograms (p = 0.064). The inter-radiologist agreement showed a percent agreement of 77.2–88.8% and a kappa value of 0.27–0.34. Radiologist characteristics did not affect agreement. Conclusion The interpretative performance of the radiologists fulfilled the mammography screening goal of the American College of Radiology, although there was inter-observer variability. Radiologists who interpreted more than 3000 screening mammograms annually tended to perform better than radiologists who did not.

[1]  J. Elmore,et al.  Screening mammograms by community radiologists: variability in false-positive rates. , 2002, Journal of the National Cancer Institute.

[2]  Andrew Page,et al.  Cancer detection and mammogram volume of radiologists in a population-based screening programme. , 2006, Breast.

[3]  P. Langenberg,et al.  Breast Imaging Reporting and Data System: inter- and intraobserver variability in feature analysis and final assessment. , 2000, AJR. American journal of roentgenology.

[4]  Klaus Krippendorff,et al.  Answering the Call for a Standard Reliability Measure for Coding Data , 2007 .

[5]  A. Derossis,et al.  Is Breast Cancer the Same Disease in Asian and Western Countries? , 2010, World Journal of Surgery.

[6]  J. Elmore,et al.  Mammographic interpretive volume and diagnostic mammogram interpretation performance in community practice. , 2012, Radiology.

[7]  Y. M. Park,et al.  Analysis of Participant Factors That Affect the Diagnostic Performance of Screening Mammography: A Report of the Alliance for Breast Cancer Screening in Korea , 2017, Korean journal of radiology.

[8]  C. Floyd,et al.  Breast imaging reporting and data system standardized mammography lexicon: observer variability in lesion description. , 1996, AJR. American journal of roentgenology.

[9]  C. Lehman,et al.  National Performance Benchmarks for Modern Screening Digital Mammography: Update from the Breast Cancer Surveillance Consortium. , 2017, Radiology.

[10]  J. Elmore,et al.  Variability in interpretive performance at screening mammography and radiologists' characteristics associated with accuracy. , 2009, Radiology.

[11]  T. Endo,et al.  Sensitivity and specificity of mammography and adjunctive ultrasonography to screen for breast cancer in the Japan Strategic Anti-cancer Randomized Trial (J-START): a randomised controlled trial , 2016, The Lancet.

[12]  A. Verbeek,et al.  The Breast Imaging Reporting and Data System (BI-RADS) in the Dutch breast cancer screening programme: its role as an assessment and stratification tool , 2012, European Radiology.

[13]  You Me Kim,et al.  The Efficacy of Mammography Boot Camp to Improve the Performance of Radiologists , 2014, Korean journal of radiology.

[14]  P. Baade,et al.  Incidence and mortality of female breast cancer in the Asia-Pacific region , 2014, Cancer biology & medicine.

[15]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[16]  Jeong Seon Park,et al.  Performance of Screening Mammography: A Report of the Alliance for Breast Cancer Screening in Korea , 2016, Korean journal of radiology.

[17]  J. Coebergh,et al.  Inter-observer variability in mammography screening and effect of type and number of readers on screening outcome , 2009, British Journal of Cancer.

[18]  Rebecca S Lewis,et al.  Does training in the Breast Imaging Reporting and Data System (BI-RADS) improve biopsy recommendations or feature analysis agreement with experienced breast imagers at mammography? , 2002, Radiology.

[19]  M. Mainiero,et al.  BI-RADS lexicon for US and mammography: interobserver variability and positive predictive value. , 2006, Radiology.

[20]  C. D'Orsi,et al.  Diagnostic Performance of Digital Versus Film Mammography for Breast-Cancer Screening , 2005, The New England journal of medicine.

[21]  R. Schulz-Wendtland,et al.  2008 update of the guideline: early detection of breast cancer in Germany , 2009, Journal of Cancer Research and Clinical Oncology.

[22]  C. D'Orsi,et al.  Accuracy of screening mammography interpretation by characteristics of radiologists. , 2004, Journal of the National Cancer Institute.

[23]  Amy Cantor,et al.  Harms of Breast Cancer Screening: Systematic Review to Update the 2009 U.S. Preventive Services Task Force Recommendation , 2016, Annals of Internal Medicine.