Multi-Institutional Assessment and Crowdsourcing Evaluation of Deep Learning for Automated Classification of Breast Density.

OBJECTIVE We developed deep learning algorithms to automatically assess BI-RADS breast density. METHODS Using a large multi-institution patient cohort of 108,230 digital screening mammograms from the Digital Mammographic Imaging Screening Trial, we investigated the effect of data, model, and training parameters on overall model performance and provided crowdsourcing evaluation from the attendees of the ACR 2019 Annual Meeting. RESULTS Our best-performing algorithm achieved good agreement with radiologists who were qualified interpreters of mammograms, with a four-class κ of 0.667. When training was performed with randomly sampled images from the data set versus sampling equal number of images from each density category, the model predictions were biased away from the low-prevalence categories such as extremely dense breasts. The net result was an increase in sensitivity and a decrease in specificity for predicting dense breasts for equal class compared with random sampling. We also found that the performance of the model degrades when we evaluate on digital mammography data formats that differ from the one that we trained on, emphasizing the importance of multi-institutional training sets. Lastly, we showed that crowdsourced annotations, including those from attendees who routinely read mammograms, had higher agreement with our algorithm than with the original interpreting radiologists. CONCLUSION We demonstrated the possible parameters that can influence the performance of the model and how crowdsourcing can be used for evaluation. This study was performed in tandem with the development of the ACR AI-LAB, a platform for democratizing artificial intelligence.

[1]  Gianluca Pollastri,et al.  A neural network approach to ordinal regression , 2007, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[2]  A. Jemal,et al.  Cancer statistics, 2019 , 2019, CA: a cancer journal for clinicians.

[3]  Bin Liu,et al.  Crowdsourcing the General Public for Large Scale Molecular Pathology Studies in Cancer , 2015, EBioMedicine.

[4]  Jared A. Dunnmon,et al.  Assessment of Convolutional Neural Networks for Automated Classification of Chest Radiographs. , 2019, Radiology.

[5]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[6]  L. Shah,et al.  Construction of a Machine Learning Dataset through Collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge. , 2020, Radiology. Artificial intelligence.

[7]  L. Liberman,et al.  Breast imaging reporting and data system (BI-RADS). , 2002, Radiologic clinics of North America.

[8]  Klaus H. Maier-Hein,et al.  Abstract: nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation , 2019, Bildverarbeitung für die Medizin.

[9]  E Keavey,et al.  Comparison of the clinical performance of three digital mammography systems in a breast cancer screening programme. , 2012, The British journal of radiology.

[10]  Carol C Wu,et al.  Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia. , 2019, Radiology. Artificial intelligence.

[11]  Ashirbani Saha,et al.  Deep learning for segmentation of brain tumors: Impact of cross‐institutional training and testing , 2018, Medical physics.

[12]  Eun Ju Son,et al.  Automated Volumetric Breast Density Measurements in the Era of the BI-RADS Fifth Edition: A Comparison With Visual Assessment. , 2016, AJR. American journal of roentgenology.

[13]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[14]  Ronilda C. Lacson,et al.  Variation in Mammographic Breast Density Assessments Among Radiologists in Clinical Practice: A Multicenter Observational Study. , 2016, Annals of internal medicine.

[15]  Hao Su,et al.  Crowdsourcing Annotations for Visual Object Detection , 2012, HCOMP@AAAI.

[16]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[18]  H. H. Thodberg,et al.  The RSNA Pediatric Bone Age Machine Learning Challenge. , 2019, Radiology.

[19]  Marcus A. Badgeley,et al.  Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study , 2018, PLoS medicine.

[20]  C. D'Orsi,et al.  Diagnostic Performance of Digital Versus Film Mammography for Breast-Cancer Screening , 2005, The New England journal of medicine.

[21]  A. Miller,et al.  Quantitative classification of mammographic densities and breast cancer risk: results from the Canadian National Breast Screening Study. , 1995, Journal of the National Cancer Institute.

[22]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[23]  Melissa A. Troester,et al.  Mammographic density and breast cancer risk in White and African American Women , 2012, Breast Cancer Research and Treatment.

[24]  Klaus H. Maier-Hein,et al.  nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation , 2018, Bildverarbeitung für die Medizin.

[25]  Bruce R. Rosen,et al.  Distributed deep learning networks among institutions for medical imaging , 2018, J. Am. Medical Informatics Assoc..

[26]  James M. Brown,et al.  Siamese neural networks for continuous disease severity evaluation and change detection in medical imaging , 2020, npj Digital Medicine.

[27]  Yahong Luo,et al.  A deep learning method for classifying mammographic breast density categories , 2018, Medical physics.

[28]  Diana L Miglioretti,et al.  Reproducibility of BI‐RADS Breast Density Measures Among Community Radiologists: A Prospective Cohort Study , 2012, The breast journal.

[29]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[30]  Subhashini Venugopalan,et al.  Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. , 2016, JAMA.

[31]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[32]  Ann L. Albright,et al.  Prevalence of diabetic retinopathy in the United States, 2005-2008. , 2010, JAMA.

[33]  Andrew H. Beck,et al.  Crowdsourcing image annotation for nucleus detection and segmentation in computational pathology: evaluating experts, automated methods, and the crowd. , 2014, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[34]  Karla Kerlikowske,et al.  Comparison of Clinical and Automated Breast Density Measurements: Implications for Risk Prediction and Supplemental Screening. , 2016, Radiology.

[35]  James M. Brown,et al.  Automatic assessment of glioma burden: a deep learning algorithm for fully automated volumetric and bidimensional measurement , 2019, Neuro-oncology.

[36]  R. Barzilay,et al.  Mammographic Breast Density Assessment Using Deep Learning: Clinical Implementation. , 2019, Radiology.

[37]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[38]  James M. Brown,et al.  Automated Diagnosis of Plus Disease in Retinopathy of Prematurity Using Deep Convolutional Neural Networks , 2018, JAMA ophthalmology.

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  L. Tabár,et al.  The impact of organized mammography service screening on breast carcinoma mortality in seven Swedish counties , 2002, Cancer.

[41]  L. Tabár,et al.  Beyond randomized controlled trials , 2001, Cancer.