Learning From Crowds

For many supervised learning tasks it may be infeasible (or very expensive) to obtain objective and reliable labels. Instead, we can collect subjective (possibly noisy) labels from multiple experts or annotators. In practice, there is a substantial amount of disagreement among the annotators, and hence it is of great practical interest to address conventional supervised learning problems in this scenario. In this paper we describe a probabilistic approach for supervised learning when we have multiple annotators providing (possibly noisy) labels but no absolute gold standard. The proposed algorithm evaluates the different experts and also gives an estimate of the actual hidden labels. Experimental results indicate that the proposed method is superior to the commonly used majority voting baseline.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[3]  S. Walter,et al.  Estimating the error rates of diagnostic tests. , 1980, Biometrics.

[4]  Gábor Lugosi,et al.  Learning with an unreliable teacher , 1992, Pattern Recognit..

[5]  W. Grove,et al.  A latent trait finite mixture model for the analysis of rating agreement. , 1993, Biometrics.

[6]  Pietro Perona,et al.  Inferring Ground Truth from Subjective Labelling of Venus Images , 1994, NIPS.

[7]  S. Hui,et al.  Evaluation of diagnostic tests without gold standards , 1998, Statistical methods in medical research.

[8]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[9]  Mark D. Reckase,et al.  Item Response Theory: Parameter Estimation Techniques , 1998 .

[10]  Eibe Frank,et al.  A Simple Approach to Ordinal Classification , 2001, ECML.

[11]  Rong Jin,et al.  Learning with Multiple Labels , 2002, NIPS.

[12]  P. Albert,et al.  A Cautionary Note on the Robustness of Latent Class Models for Estimating Diagnostic Error without a Gold Standard , 2004, Biometrics.

[13]  William M. Wells,et al.  Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation , 2004, IEEE Transactions on Medical Imaging.

[14]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[15]  R. Bharat Rao,et al.  Multiple-Instance Learning Improves CAD Detection of Masses in Digital Mammography , 2008, Digital Mammography / IWDM.

[16]  Jeff Howe,et al.  Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business , 2008, Human Resource Management International Digest.

[17]  Avrim Blum,et al.  Veritas: Combining Expert Opinions without Labeled Data , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[18]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[19]  Inc. Alias-i Multilevel Bayesian Models of Categorical Data Annotation , 2008 .

[20]  David A. Forsyth,et al.  Utility data annotation with Amazon Mechanical Turk , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[21]  Ohad Shamir,et al.  Good learners for evil teachers , 2009, ICML '09.

[22]  Ohad Shamir,et al.  Vox Populi: Collecting High-Quality Labels from a Crowd , 2009, COLT.

[23]  Gerardo Hermosillo,et al.  Supervised learning from multiple experts: whom to trust when everyone lies a bit , 2009, ICML '09.

[24]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[25]  Xiang Zhou,et al.  Mining Medical Images , 2009 .

[26]  Brian A Vander Schee Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business , 2009 .

[27]  Jaime G. Carbonell,et al.  Efficiently learning the accuracy of labeling sources for selective sampling , 2009, KDD.