International evaluation of an AI system for breast cancer screening

Screening mammography aims to identify breast cancer at earlier stages of the disease, when treatment can be more successful 1 . Despite the existence of screening programmes worldwide, the interpretation of mammograms is affected by high rates of false positives and false negatives 2 . Here we present an artificial intelligence (AI) system that is capable of surpassing human experts in breast cancer prediction. To assess its performance in the clinical setting, we curated a large representative dataset from the UK and a large enriched dataset from the USA. We show an absolute reduction of 5.7% and 1.2% (USA and UK) in false positives and 9.4% and 2.7% in false negatives. We provide evidence of the ability of the system to generalize from the UK to the USA. In an independent study of six radiologists, the AI system outperformed all of the human readers: the area under the receiver operating characteristic curve (AUC-ROC) for the AI system was greater than the AUC-ROC for the average radiologist by an absolute margin of 11.5%. We ran a simulation in which the AI system participated in the double-reading process that is used in the UK, and found that the AI system maintained non-inferior performance and reduced the workload of the second reader by 88%. This robust assessment of the AI system paves the way for clinical trials to improve the accuracy and efficiency of breast cancer screening. An artificial intelligence (AI) system performs as well as or better than radiologists at detecting breast cancer from mammograms, and using a combination of AI and human inputs could help to improve screening efficiency.

[1]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[2]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[3]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[4]  N. Obuchowski,et al.  Hypothesis testing of diagnostic accuracy for multiple readers and multiple tests: An anova approach with dependent observations , 1995 .

[5]  R. Warren,et al.  Mammography screening: an incremental cost effectiveness analysis of double versus single reading of mammograms , 1996, BMJ.

[6]  R. Swensson Unified measurement of observer performance in detecting and localizing target objects on images. , 1996, Medical physics.

[7]  M. Aickin,et al.  Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods. , 1996, American journal of public health.

[8]  N A Obuchowski,et al.  On the comparison of correlated proportions for clustered data. , 1998, Statistics in medicine.

[9]  C. Metz,et al.  "Proper" Binormal ROC Curves: Theory and Maximum-Likelihood Estimation. , 1999, Journal of mathematical psychology.

[10]  D. Wolverton,et al.  Performance parameters for screening and diagnostic mammography: specialist and general radiologists. , 2002, Radiology.

[11]  Huey-miin Hsueh,et al.  Tests for equivalence or non‐inferiority for paired binary data , 2002, Statistics in medicine.

[12]  Debra M Ikeda,et al.  Computer-aided detection output on 172 subtle findings on normal mammograms previously obtained in women with breast cancer detected at follow-up screening mammography. , 2004, Radiology.

[13]  C. D'Orsi,et al.  Diagnostic Performance of Digital Versus Film Mammography for Breast-Cancer Screening , 2005, The New England journal of medicine.

[14]  C. D'Orsi,et al.  Influence of computer-aided detection on performance of screening mammography. , 2007, The New England journal of medicine.

[15]  Wende Logan-Young,et al.  Evaluation of computer-aided detection systems in the detection of small invasive breast carcinoma. , 2007, Radiology.

[16]  S. Hillis A comparison of denominator degrees of freedom methods for multiple observer ROC analysis , 2007, Statistics in medicine.

[17]  David Gur,et al.  Comparing areas under receiver operating characteristic curves: potential impact of the "Last" experimentally measured operating point. , 2008, Radiology.

[18]  Gengsheng Qin,et al.  Comparison of non-parametric confidence intervals for the area under the ROC curve of a continuous-scale diagnostic test , 2008, Statistical methods in medical research.

[19]  S. Astley,et al.  Single reading with computer-aided detection for screening mammography. , 2008, The New England journal of medicine.

[20]  M. Giger,et al.  Anniversary paper: History and status of CAD and quantitative image analysis: the role of Medical Physics and AAPM. , 2008, Medical physics.

[21]  Y. Nakajima,et al.  Radiologist supply and workload: international comparison , 2008, Radiation Medicine.

[22]  Hong-Jun Yoon,et al.  Operating characteristics predicted by models for diagnostic tasks involving lesion localization. , 2008, Medical physics.

[23]  Paul Wing,et al.  Workforce shortages in breast imaging: impact on mammography utilization. , 2009, AJR. American journal of roentgenology.

[24]  J. Elmore,et al.  Variability in interpretive performance at screening mammography and radiologists' characteristics associated with accuracy. , 2009, Radiology.

[25]  C. D'Orsi,et al.  Breast cancer screening with imaging: recommendations from the Society of Breast Imaging and the ACR on the use of mammography, breast MRI, breast ultrasound, and other technologies for the detection of clinically occult breast cancer. , 2010, Journal of the American College of Radiology : JACR.

[26]  Nico Karssemeijer,et al.  Using computer-aided detection in mammography as a decision support , 2010, European Radiology.

[27]  Jonathan H Sunshine,et al.  How widely is computer-aided detection used in screening and diagnostic mammography? , 2010, Journal of the American College of Radiology : JACR.

[28]  J. Hardin,et al.  A note on the tests for clustered matched‐pair binary data , 2010, Biometrical journal. Biometrische Zeitschrift.

[29]  L. Tabár,et al.  Swedish two-county trial: impact of mammographic screening on breast cancer mortality during 3 decades. , 2011, Radiology.

[30]  Marcello Tonelli,et al.  Recommendations on screening for breast cancer in average-risk women aged 40–74 years , 2011, Canadian Medical Association Journal.

[31]  C. de Wolf,et al.  Mammographic Screening Programmes in Europe: Organization, Coverage and Participation , 2012, Journal of medical screening.

[32]  The Australian BreastScreen workforce: a snapshot , 2012 .

[33]  Kyle J Myers,et al.  Evaluating imaging and computer-aided detection and diagnosis devices at the FDA. , 2012, Academic radiology.

[34]  Paul F Pinsky,et al.  Enriched designs for assessing discriminatory performance — analysis of bias and variance , 2012, Statistics in medicine.

[35]  D G Altman,et al.  The benefits and harms of breast cancer screening: an independent review , 2013, British Journal of Cancer.

[36]  Petter Laake,et al.  Recommended tests and confidence intervals for paired binomial proportions , 2014, Statistics in medicine.

[37]  E. Pisano,et al.  Consequences of false-positive screening mammograms. , 2014, JAMA internal medicine.

[38]  J. Lortet-Tieulent,et al.  Breast Cancer Screening for Women at Average Risk: 2015 Guideline Update From the American Cancer Society. , 2015, JAMA.

[39]  C. Lehman,et al.  Diagnostic Accuracy of Digital Screening Mammography With and Without Computer-Aided Detection. , 2015, JAMA internal medicine.

[40]  Douglas G Altman,et al.  Inverse probability weighting , 2016, British Medical Journal.

[41]  Subhashini Venugopalan,et al.  Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. , 2016, JAMA.

[42]  T. Wilt,et al.  Screening for Breast Cancer: U.S. Preventive Services Task Force Recommendation Statement , 2011 .

[43]  N. Houssami,et al.  The epidemiology, radiology and biological characteristics of interval breast cancers in population mammography screening , 2017, npj Breast Cancer.

[44]  C. Lehman,et al.  National Performance Benchmarks for Modern Screening Digital Mammography: Update from the Breast Cancer Surveillance Consortium. , 2017, Radiology.

[45]  Dev P. Chakraborty,et al.  Observer Performance Methods for Diagnostic Imaging: Foundations, Modeling, and Applications with R-Based Examples , 2017 .

[46]  Thomas Frauenfelder,et al.  Deep Learning in Mammography: Diagnostic Accuracy of a Multipurpose Image Analysis Software in the Detection of Breast Cancer , 2017, Investigative radiology.

[47]  Abi Rimmer,et al.  Radiologist shortage leaves patient care at risk, warns royal college , 2017, British Medical Journal.

[48]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[49]  S. Jha,et al.  Why CAD Failed in Mammography. , 2018, Journal of the American College of Radiology : JACR.

[50]  István Csabai,et al.  Detecting and classifying lesions in mammograms with Deep Learning , 2017, Scientific Reports.

[51]  A. Jemal,et al.  Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries , 2018, CA: a cancer journal for clinicians.

[52]  Geraint Rees,et al.  Clinically applicable deep learning for diagnosis and referral in retinal disease , 2018, Nature Medicine.

[53]  Marcus A. Badgeley,et al.  Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study , 2018, PLoS medicine.

[54]  G. Corrado,et al.  End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography , 2019, Nature Medicine.

[55]  T. Helbich,et al.  Stand-Alone Artificial Intelligence for Breast Cancer Detection in Mammography: Comparison With 101 Radiologists. , 2019, Journal of the National Cancer Institute.

[56]  Eric J Topol,et al.  High-performance medicine: the convergence of human and artificial intelligence , 2019, Nature Medicine.

[57]  Nan Wu,et al.  Deep Neural Networks Improve Radiologists’ Performance in Breast Cancer Screening , 2019, IEEE Transactions on Medical Imaging.