Processing multi-expert annotations in digital pathology: a study of the Gleason 2019 challenge

Deep learning algorithms rely on large amounts of annotations for learning and testing. In digital pathology, a ground truth is rarely available, and many tasks show large inter-expert disagreement. Using the Gleason2019 dataset, we analyse how the choices we make in getting the ground truth from multiple experts may affect the results and the conclusions we could make from challenges and benchmarks. We show that using undocumented consensus methods, as is often done, reduces our ability to properly analyse challenge results. We also show that taking into account each expert’s annotations enriches discussions on results and is more in line with the clinical reality and complexity of the application.

[1]  Septimiu E. Salcudean,et al.  Comparison of Artificial Intelligence Techniques to Evaluate Performance of a Classifier for Automatic Grading of Prostate Cancer From Digitized Histopathologic Images , 2019, JAMA network open.

[2]  Daniel C. Alexander,et al.  Foveation for Segmentation of Mega-Pixel Histology Images , 2020, MICCAI.

[3]  Gang Wang,et al.  Automatic grading of prostate cancer in digitized histopathology images: Learning from multiple experts , 2018, Medical Image Anal..

[4]  André Stumpf,et al.  An Empirical Study Into Annotator Agreement, Ground Truth Estimation, and Algorithm Evaluation , 2013, IEEE Transactions on Image Processing.

[5]  Amalia Luque,et al.  The impact of class imbalance in classification performance metrics based on the binary confusion matrix , 2019, Pattern Recognit..

[6]  Catarina Eloy,et al.  Classification of breast cancer histology images using Convolutional Neural Networks , 2017, PloS one.

[7]  Martha Elizabeth Shenton,et al.  On evaluating brain tissue classifiers without a ground truth , 2007, NeuroImage.

[8]  O. Ciccarelli,et al.  Disentangling Human Error from the Ground Truth in Segmentation of Medical Images , 2020, NeurIPS 2020.

[9]  Avinash Lokhande,et al.  Carcino-Net: A Deep Learning Framework for Automated Gleason Grading of Prostate Biopsies , 2020, 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC).

[10]  Liron Pantanowitz,et al.  Value of Public Challenges for the Development of Pathology Deep Learning Algorithms , 2020, Journal of pathology informatics.

[11]  Aaron Carass,et al.  Why rankings of biomedical image analysis competitions should be interpreted with care , 2018, Nature Communications.

[12]  Chaomin Shen,et al.  Gleason Score Prediction using Deep Learning in Tissue Microarray Image , 2020, ArXiv.

[13]  William M. Wells,et al.  Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation , 2004, IEEE Transactions on Medical Imaging.

[14]  J. Epstein,et al.  Interobserver reproducibility of Gleason grading of prostatic carcinoma: general pathologist. , 2001, Human pathology.

[15]  Guy Nir,et al.  Deep Learning-Based Gleason Grading of Prostate Cancer From Histopathology Images—Role of Multiscale Decision Aggregation and Data Augmentation , 2020, IEEE Journal of Biomedical and Health Informatics.

[16]  Anne L. Martel,et al.  Learning to segment images with classification labels , 2020, Medical image analysis.

[17]  Pietro Perona,et al.  Inferring Ground Truth from Subjective Labelling of Venus Images , 1994, NIPS.

[18]  L. Egevad,et al.  A Contemporary Prostate Cancer Grading System: A Validated Alternative to the Gleason Score. , 2016, European urology.

[19]  Xiang Li,et al.  Estimating the ground truth from multiple individual segmentations incorporating prior pattern analysis with application to skin lesion segmentation , 2011, 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

[20]  Simon K. Warfield,et al.  Deep learning with noisy labels: exploring techniques and remedies in medical image analysis , 2020, Medical Image Anal..

[21]  Meyke Hermsen,et al.  1399 H&E-stained sentinel lymph node sections of breast cancer patients: the CAMELYON dataset , 2018, GigaScience.

[22]  Towards Automatic Prostate Gleason Grading Via Deep Convolutional Neural Networks , 2019, 2019 5th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS).

[23]  D. Bostwick,et al.  Interobserver reproducibility of Gleason grading of prostatic carcinoma: urologic pathologists. , 2001, Human pathology.

[24]  D. Gleason,et al.  PREDICTION OF PROGNOSIS FOR PROSTATIC ADENOCARCINOMA BY COMBINED HISTOLOGICAL GRADING AND CLINICAL STAGING , 2017, The Journal of urology.