External validation of a publicly available computer assisted diagnostic tool for mammographic mass lesions with two high prevalence research datasets.

PURPOSE Lesions detected at mammography are described with a highly standardized terminology: the breast imaging-reporting and data system (BI-RADS) lexicon. Up to now, no validated semantic computer assisted classification algorithm exists to interactively link combinations of morphological descriptors from the lexicon to a probabilistic risk estimate of malignancy. The authors therefore aim at the external validation of the mammographic mass diagnosis (MMassDx) algorithm. A classification algorithm like MMassDx must perform well in a variety of clinical circumstances and in datasets that were not used to generate the algorithm in order to ultimately become accepted in clinical routine. METHODS The MMassDx algorithm uses a naïve Bayes network and calculates post-test probabilities of malignancy based on two distinct sets of variables, (a) BI-RADS descriptors and age ("descriptor model") and (b) BI-RADS descriptors, age, and BI-RADS assessment categories ("inclusive model"). The authors evaluate both the MMassDx (descriptor) and MMassDx (inclusive) models using two large publicly available datasets of mammographic mass lesions: the digital database for screening mammography (DDSM) dataset, which contains two subsets from the same examinations-a medio-lateral oblique (MLO) view and cranio-caudal (CC) view dataset-and the mammographic mass (MM) dataset. The DDSM contains 1220 mass lesions and the MM dataset contains 961 mass lesions. The authors evaluate discriminative performance using area under the receiver-operating-characteristic curve (AUC) and compare this to the BI-RADS assessment categories alone (i.e., the clinical performance) using the DeLong method. The authors also evaluate whether assigned probabilistic risk estimates reflect the lesions' true risk of malignancy using calibration curves. RESULTS The authors demonstrate that the MMassDx algorithms show good discriminatory performance. AUC for the MMassDx (descriptor) model in the DDSM data is 0.876/0.895 (MLO/CC view) and AUC for the MMassDx (inclusive) model in the DDSM data is 0.891/0.900 (MLO/CC view). AUC for the MMassDx (descriptor) model in the MM data is 0.862 and AUC for the MMassDx (inclusive) model in the MM data is 0.900. In all scenarios, MMassDx performs significantly better than clinical performance, P < 0.05 each. The authors furthermore demonstrate that the MMassDx algorithm systematically underestimates the risk of malignancy in the DDSM and MM datasets, especially when low probabilities of malignancy are assigned. CONCLUSIONS The authors' results reveal that the MMassDx algorithms have good discriminatory performance but less accurate calibration when tested on two independent validation datasets. Improvement in calibration and testing in a prospective clinical population will be important steps in the pursuit of translation of these algorithms to the clinic.

[1]  P. Langenberg,et al.  Breast Imaging Reporting and Data System: inter- and intraobserver variability in feature analysis and final assessment. , 2000, AJR. American journal of roentgenology.

[2]  S. Orel,et al.  BI-RADS categorization as a predictor of malignancy. , 1999, Radiology.

[3]  C. Floyd,et al.  Breast imaging reporting and data system standardized mammography lexicon: observer variability in lesion description. , 1996, AJR. American journal of roentgenology.

[4]  K. Covinsky,et al.  Assessing the Generalizability of Prognostic Information , 1999, Annals of Internal Medicine.

[5]  M. Elter,et al.  CADx of mammographic masses and clustered microcalcifications: a review. , 2009, Medical physics.

[6]  E. Burnside,et al.  The ACR BI-RADS experience: learning from history. , 2009, Journal of the American College of Radiology : JACR.

[7]  M.K. Markey,et al.  Bayesian networks of BI-RADS/spl trade/ descriptors for breast lesion classification , 2004, The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[8]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[9]  M. Zweig,et al.  Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. , 1993, Clinical chemistry.

[10]  L. Liberman,et al.  The breast imaging reporting and data system: positive predictive value of mammographic features and final assessment categories. , 1998, AJR. American journal of roentgenology.

[11]  BIRADS mammography: exercises. , 2007, European journal of radiology.

[12]  M. Elter,et al.  The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process. , 2007, Medical physics.

[13]  Oguzhan Alagoz,et al.  What Is the Optimal Threshold at Which to Recommend Breast Biopsy? , 2012, PloS one.

[14]  G. Collins,et al.  External validation of multivariable prediction models: a systematic review of methodological conduct and reporting , 2014, BMC Medical Research Methodology.

[15]  M. Benndorf,et al.  Provision of the DDSM mammography metadata in an accessible format. , 2014, Medical physics.

[16]  Yvonne Vergouwe,et al.  Prognosis and prognostic research: validating a prognostic model , 2009, BMJ : British Medical Journal.

[17]  C. Floyd,et al.  Breast cancer: prediction with artificial neural network based on BI-RADS standardized lexicon. , 1995, Radiology.

[18]  Maryellen L Giger,et al.  Prevalence scaling: applications to an intelligent workstation for the diagnosis of breast cancer. , 2008, Academic radiology.

[19]  H. Zonderland,et al.  The positive predictive value of the breast imaging reporting and data system (BI-RADS) as a method of quality assessment in breast imaging in a hospital population , 2004, European Radiology.

[20]  P Haddawy,et al.  Construction of a Bayesian network for mammographic diagnosis of breast cancer , 1997, Comput. Biol. Medicine.

[21]  Richard H. Moore,et al.  THE DIGITAL DATABASE FOR SCREENING MAMMOGRAPHY , 2007 .

[22]  Rebecca S Lewis,et al.  Does training in the Breast Imaging Reporting and Data System (BI-RADS) improve biopsy recommendations or feature analysis agreement with experienced breast imagers at mammography? , 2002, Radiology.

[23]  E. Burnside,et al.  Development of an online, publicly accessible naive Bayesian decision support tool for mammographic mass lesions based on the American College of Radiology (ACR) BI-RADS lexicon , 2015, European Radiology.

[24]  Elizabeth S Burnside,et al.  Bayesian networks: computer-assisted diagnosis support in radiology. , 2005, Academic radiology.

[25]  Mireille J. M. Broeders,et al.  Breast cancer risk prediction model: a nomogram based on common mammographic screening findings , 2013, European Radiology.

[26]  Y. Wu,et al.  Artificial neural networks in mammography: application to decision making in the diagnosis of breast cancer. , 1993, Radiology.

[27]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[28]  C. D. Page,et al.  Probabilistic computer model developed from clinical data in national mammography database format to classify mammographic findings. , 2009, Radiology.

[29]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[30]  S. Astley,et al.  Single reading with computer-aided detection for screening mammography. , 2008, The New England journal of medicine.

[31]  Jesse A. Berlin,et al.  Assessing the Generalizability of Prognostic Information , 1999 .

[32]  Joseph Y. Lo,et al.  Bayesian networks of BI-RADS™ descriptors for breast lesion Classification , 2004 .

[33]  V. McCormack,et al.  Breast Density and Parenchymal Patterns as Markers of Breast Cancer Risk: A Meta-analysis , 2006, Cancer Epidemiology Biomarkers & Prevention.

[34]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[35]  E. Burnside,et al.  A logistic regression model based on the national mammography database format to aid breast cancer diagnosis. , 2009, AJR. American journal of roentgenology.

[36]  Karla Kerlikowske,et al.  Performance benchmarks for screening mammography. , 2006, Radiology.

[37]  C. D'Orsi,et al.  Influence of computer-aided detection on performance of screening mammography. , 2007, The New England journal of medicine.

[38]  A. Feinstein,et al.  Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. , 1978, The New England journal of medicine.

[39]  J. Hopper,et al.  Average risks of breast and ovarian cancer associated with BRCA1 or BRCA2 mutations detected in case Series unselected for family history: a combined analysis of 22 studies. , 2003, American journal of human genetics.

[40]  N. Obuchowski,et al.  Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures , 2010, Epidemiology.

[41]  M. Mainiero,et al.  BI-RADS lexicon for US and mammography: interobserver variability and positive predictive value. , 2006, Radiology.

[42]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.