Can non-specialists provide high quality gold standard labels in challenging modalities?

Probably yes. — Supervised Deep Learning dominates performance scores for many computer vision tasks and defines the stateof-the-art. However, medical image analysis lags behind natural image applications. One of the many reasons is the lack of well annotated medical image data available to researchers. One of the first things researchers are told is that we require significant expertise to reliably and accurately interpret and label such data. We see significant interand intra-observer variability between expert annotations of medical images. Still, it is a widely held assumption that novice annotators are unable to provide useful annotations for use by clinical Deep Learning models. In this work we challenge this assumption and examine the implications of using a minimally trained novice labelling workforce to acquire annotations for a complex medical image dataset. We study the time and cost implications of using novice annotators, the raw performance of novice annotators compared to gold-standard expert annotators, and the downstream effects on a trained Deep Learning segmentation model’s performance for detecting a specific congenital heart disease (hypoplastic left heart syndrome) in fetal ultrasound imaging.

[1]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[2]  Vikas Sindhwani,et al.  Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria , 2009, HLT-NAACL 2009.

[3]  Elena Paslaru Bontas Simperl,et al.  An investigation of player motivations in Eyewire, a gamified citizen science project , 2017, Comput. Hum. Behav..

[4]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[5]  Ji Fang,et al.  Pruning Non-Informative Text Through Non-Expert Annotations to Improve Aspect-Level Sentiment Classification , 2010, PWNLP@COLING.

[6]  Marleen de Bruijne,et al.  Early Experiences with Crowdsourcing Airway Annotations in Chest CT , 2016, LABELS/DLMIA@MICCAI.

[7]  Ece Kamar,et al.  Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets , 2017, CHI.

[8]  Iryna Gurevych,et al.  Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets , 2014, PACLIC.

[9]  Nima Tajbakhsh,et al.  Embracing Imperfect Datasets: A Review of Deep Learning Solutions for Medical Image Segmentation , 2019, Medical Image Anal..

[10]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[11]  Hang Yu,et al.  Robustness study of noisy annotation in deep learning based medical image segmentation , 2020, Physics in medicine and biology.

[12]  Keno März,et al.  Large-scale medical image annotation with crowd-powered algorithms , 2018, Journal of medical imaging.

[13]  Marc Aubreville,et al.  How Many Annotators Do We Need? - A Study on the Influence of Inter-Observer Variability on the Reliability of Automatic Mitotic Figure Assessment , 2020, ArXiv.

[14]  Jeremy Tan,et al.  Detecting Hypo-plastic Left Heart Syndrome in Fetal Ultrasound via Disease-specific Atlas Maps , 2021, MICCAI.