Who Said What: Modeling Individual Labelers Improves Classification

Data are often labeled by many different experts with each expert only labeling a small fraction of the data and each data point being labeled by several experts. This reduces the workload on individual experts and also gives a better estimate of the unobserved ground truth. When experts disagree, the standard approaches are to treat the majority opinion as the correct label or to model the correct label as a distribution. These approaches, however, do not make any use of potentially valuable information about which expert produced which label. To make use of this extra information, we propose modeling the experts individually and then learning averaging weights for combining them, possibly in sample-specific ways. This allows us to give more weight to more reliable experts and take advantage of the unique strengths of individual experts at classifying certain types of data. Here we show that our approach leads to improvements in computer-aided diagnosis of diabetic retinopathy. We also show that our method performs better than competing algorithms by Welinder and Perona (2010), and by Mnih and Hinton (2012). Our work offers an innovative approach for dealing with the myriad real-world settings that use expert opinions to define labels for training.

[1]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[2]  S. Walter,et al.  Estimating the error rates of diagnostic tests. , 1980, Biometrics.

[3]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[4]  Pietro Perona,et al.  Inferring Ground Truth from Subjective Labelling of Venus Images , 1994, NIPS.

[5]  J. Elmore,et al.  Variability in radiologists' interpretations of mammograms. , 1994, The New England journal of medicine.

[6]  P. Bossuyt,et al.  Evaluation of diagnostic tests when there is no gold standard. A review of methods. , 2007, Health technology assessment.

[7]  Gerardo Hermosillo,et al.  Supervised learning from multiple experts: whom to trust when everyone lies a bit , 2009, ICML '09.

[8]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[9]  Pietro Perona,et al.  Online crowdsourcing: Rating annotators and obtaining cost-effective labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[10]  S. Haneda,et al.  [International clinical diabetic retinopathy disease severity scale]. , 2010, Nihon rinsho. Japanese journal of clinical medicine.

[11]  Pietro Perona,et al.  The Multidimensional Wisdom of Crowds , 2010, NIPS.

[12]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[13]  Michael I. Jordan,et al.  Bayesian Bias Mitigation for Crowdsourcing , 2011, NIPS.

[14]  Jennifer G. Dy,et al.  Active Learning from Crowds , 2011, ICML.

[15]  Devavrat Shah,et al.  Iterative Learning for Reliable Crowdsourcing Systems , 2011, NIPS.

[16]  Jian Peng,et al.  Variational Inference for Crowdsourcing , 2012, NIPS.

[17]  Shipeng Yu,et al.  Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks , 2012, J. Mach. Learn. Res..

[18]  Geoffrey E. Hinton,et al.  Learning to Label Aerial Images from Noisy Data , 2012, ICML.

[19]  Xi Chen,et al.  Optimistic Knowledge Gradient Policy for Optimal Budget Allocation in Crowdsourcing , 2013, ICML.

[20]  G. Quellec,et al.  Automated analysis of retinal images for detection of referable diabetic retinopathy. , 2013, JAMA ophthalmology.

[21]  Bernardete Ribeiro,et al.  Gaussian Process Classification and Active Learning with Multiple Annotators , 2014, ICML.

[22]  Guy Cazuguel,et al.  FEEDBACK ON A PUBLICLY DISTRIBUTED IMAGE DATABASE: THE MESSIDOR DATABASE , 2014 .

[23]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[24]  Yee Whye Teh,et al.  Bayesian nonparametric crowdsourcing , 2014, J. Mach. Learn. Res..

[25]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Chaithanya A Ramachandra,et al.  EyeArt: Automated, High-throughput, Image Analysis for Diabetic Retinopathy Screening , 2015 .

[28]  J. Elmore,et al.  Diagnostic concordance among pathologists interpreting breast biopsy specimens. , 2015, JAMA.

[29]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[30]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Christopher De Sa,et al.  Data Programming: Creating Large Training Sets, Quickly , 2016, NIPS.

[32]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[33]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[34]  Subhashini Venugopalan,et al.  Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. , 2016, JAMA.

[35]  Shadi Albarqouni,et al.  AggNet : Deep Learning From Crowds for Mitosis Detection in Breast Cancer Histology Images , 2016 .

[36]  Nassir Navab,et al.  AggNet: Deep Learning From Crowds for Mitosis Detection in Breast Cancer Histology Images , 2016, IEEE Trans. Medical Imaging.

[37]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[38]  Diabetes Federation IDF Diabetes Atlas 9th Edition , 2019 .