Revisiting Machine Learning from Crowds a Mixture Model for Grouping Annotations

Today, supervised learning is widely used for pattern recognition, computer vision and other tasks. In this setting, data need to be explicitly annotated. Unfortunately, obtaining accurate labels can be difficult, expensive and time-consuming. As a result, many machine learning projects rely on labelling processes that involve crowds, i.e. multiple subjective and inexpert annotators. Handling this noise in a principled way is an important challenge for machine learning, called learning from crowds. In this paper, we present a model that learns patterns of label noise by grouping annotations. In contrast to previous art, we do not model specific labeling patterns for each annotator but explain the data using a fixed-size mixture model. This approach allows to handle a sparse distribution of labels among annotators and obtain a model with less parameters that can scale better to large-scale scenarios. Experiments on real and simulated data illustrate the advantages of our approach.

[1]  Bernardete Ribeiro,et al.  Learning from multiple annotators: Distinguishing good from random labelers , 2013, Pattern Recognit. Lett..

[2]  Francisco C. Pereira,et al.  Deep learning from crowds , 2017, AAAI.

[3]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[4]  Xindong Wu,et al.  Imbalanced Multiple Noisy Labeling , 2015, IEEE Transactions on Knowledge and Data Engineering.

[5]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[6]  Hisashi Kashima,et al.  A Convex Formulation for Learning from Crowds , 2012, AAAI.

[7]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[8]  Guoliang Li,et al.  Truth Inference in Crowdsourcing: Is the Problem Solved? , 2017, Proc. VLDB Endow..

[9]  Xi Chen,et al.  Spectral Methods Meet EM: A Provably Optimal Algorithm for Crowdsourcing , 2014, J. Mach. Learn. Res..

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  Shipeng Yu,et al.  Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks , 2012, J. Mach. Learn. Res..

[12]  Richard Nock,et al.  Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Nassir Navab,et al.  AggNet: Deep Learning From Crowds for Mitosis Detection in Breast Cancer Histology Images , 2016, IEEE Trans. Medical Imaging.

[14]  Mark W. Schmidt,et al.  Modeling annotator expertise: Learning when everybody knows a bit of something , 2010, AISTATS.