Monte Carlo dropout increases model repeatability

The integration of artificial intelligence into clinical workflows requires reliable and robust models. Among the main features of robustness is repeatability. Much attention is given to classification performance without assessing the model repeatability, leading to the development of models that turn out to be unusable in practice. In this work, we evaluate the repeatability of four model types on images from the same patient that were acquired during the same visit. We study the performance of binary, multi-class, ordinal, and regression models on three medical image analysis tasks: cervical cancer screening, breast density estimation, and retinopathy of prematurity classification. Moreover, we assess the impact of sampling Monte Carlo dropout predictions at test time on classification performance and repeatability. Leveraging Monte Carlo predictions significantly increased repeatability for all tasks on the binary, multi-class, and ordinal models leading to an average reduction of the 95% limits of agreement by 17% points.

[1]  L. R. Long,et al.  A demonstration of automated visual evaluation of cervical images taken with a smartphone camera , 2020, International journal of cancer.

[2]  Jacques Ferlay,et al.  Estimates of incidence and mortality of cervical cancer in 2018: a worldwide analysis , 2019, The Lancet. Global health.

[3]  Wenzhi Cao,et al.  Rank-consistent Ordinal Regression for Neural Networks , 2019 .

[4]  Mayoore S. Jaiswal,et al.  An Observational Study of Deep Learning and Automated Evaluation of Cervical Images for Cancer Screening. , 2019, Journal of the National Cancer Institute.

[5]  James M. Brown,et al.  Automated Diagnosis of Plus Disease in Retinopathy of Prematurity Using Deep Convolutional Neural Networks , 2018, JAMA ophthalmology.

[6]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[7]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[8]  P. Petignat,et al.  Cervical cancer screening in developing countries at a crossroad: Emerging technologies and policy choices. , 2015, World journal of clinical oncology.

[9]  Michael F. Chiang,et al.  Development and Evaluation of Reference Standards for Image-based Telemedicine Diagnosis and Clinical Research Studies in Ophthalmology , 2014, AMIA.

[10]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[11]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[12]  Ling Li,et al.  Ordinal Regression by Extended Binary Classification , 2006, NIPS.

[13]  C. D'Orsi,et al.  Diagnostic Performance of Digital Versus Film Mammography for Breast-Cancer Screening , 2005, The New England journal of medicine.

[14]  Allan Hildesheim,et al.  Description of a seven-year prospective study of human papillomavirus infection and cervical neoplasia among 10000 women in Guanacaste, Costa Rica,. , 2004, Revista panamericana de salud publica = Pan American journal of public health.

[15]  F. Manning,et al.  Findings to Date , 2004 .

[16]  M. Schiffman,et al.  Findings to date from the ASCUS-LSIL Triage Study (ALTS). , 2003, Archives of pathology & laboratory medicine.

[17]  L. Liberman,et al.  Breast imaging reporting and data system (BI-RADS). , 2002, Radiologic clinics of North America.

[18]  D. Altman,et al.  Measuring agreement in method comparison studies , 1999, Statistical methods in medical research.

[19]  A. Miller,et al.  Quantitative classification of mammographic densities and breast cancer risk: results from the Canadian National Breast Screening Study. , 1995, Journal of the National Cancer Institute.