Debiasing Crowdsourced Batches

Crowdsourcing is the de-facto standard for gathering annotated data. While, in theory, data annotation tasks are assumed to be attempted by workers independently, in practice, data annotation tasks are often grouped into batches to be presented and annotated by workers together, in order to save on the time or cost overhead of providing instructions or necessary background. Thus, even though independence is usually assumed between annotations on data items within the same batch, in most cases, a worker's judgment on a data item can still be affected by other data items within the batch, leading to additional errors in collected labels. In this paper, we study the data annotation bias when data items are presented as batches to be judged by workers simultaneously. We propose a novel worker model to characterize the annotating behavior on data batches, and present how to train the worker model on annotation data sets. We also present a debiasing technique to remove the effect of such annotation bias from adversely affecting the accuracy of labels obtained. Our experimental results on synthetic and real-world data sets demonstrate that our proposed method can achieve up to +57% improvement in F1-score compared to the standard majority voting baseline.

[1]  Pietro Perona,et al.  Crowdclustering , 2011, NIPS.

[2]  Shipeng Yu,et al.  Ranking annotators for crowdsourced labeling tasks , 2011, NIPS.

[3]  Jennifer Widom,et al.  Optimal Crowd-Powered Rating and Filtering Algorithms , 2014, Proc. VLDB Endow..

[4]  L. Bottou,et al.  Generalized Method-of-Moments for Rank Aggregation , 2013 .

[5]  David R. Karger,et al.  Human-powered Sorts and Joins , 2011, Proc. VLDB Endow..

[6]  Milad Shokouhi,et al.  Community-based bayesian aggregation models for crowdsourcing , 2014, WWW.

[7]  Hao Su,et al.  Crowdsourcing Annotations for Visual Object Detection , 2012, HCOMP@AAAI.

[8]  R. Plackett The Analysis of Permutations , 1975 .

[9]  Djoerd Hiemstra,et al.  Exploiting user disagreement for web search evaluation: an experimental approach , 2014, WSDM.

[10]  D. Hunter MM algorithms for generalized Bradley-Terry models , 2003 .

[11]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[12]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[13]  Omar Alonso,et al.  Crowdsourcing for relevance evaluation , 2008, SIGF.

[14]  R. Luce,et al.  Individual Choice Behavior: A Theoretical Analysis. , 1960 .

[15]  Dan Roth,et al.  Unsupervised rank aggregation with distance-based models , 2008, ICML '08.

[16]  H. A. David,et al.  The method of paired comparisons , 1966 .

[17]  Shipeng Yu,et al.  An Entropic Score to Rank Annotators for Crowdsourced Labeling Tasks , 2011, 2011 Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics.

[18]  R. Duncan Luce,et al.  Individual Choice Behavior: A Theoretical Analysis , 1979 .

[19]  Devavrat Shah,et al.  Iterative ranking from pair-wise comparisons , 2012, NIPS.

[20]  R. A. Bradley,et al.  Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons , 1952 .

[21]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[22]  Ariel D. Procaccia,et al.  Better Human Computation Through Principled Voting , 2013, AAAI.

[23]  Devavrat Shah,et al.  Iterative Learning for Reliable Crowdsourcing Systems , 2011, NIPS.

[24]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[25]  Tao Qin,et al.  Supervised rank aggregation , 2007, WWW '07.

[26]  Chin-Laung Lei,et al.  A crowdsourceable QoE evaluation framework for multimedia content , 2009, ACM Multimedia.

[27]  Tao Qin,et al.  A New Probabilistic Model for Rank Aggregation , 2010, NIPS.

[28]  Abhimanyu Das,et al.  Debiasing social wisdom , 2013, KDD.

[29]  Mark Sanderson,et al.  Quantifying test collection quality based on the consistency of relevance judgements , 2011, SIGIR.

[30]  R. A. Bradley,et al.  RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS , 1952 .

[31]  Falk Scholer,et al.  The effect of threshold priming and need for cognition on relevance calibration and assessment , 2013, SIGIR.

[32]  David C. Parkes,et al.  Computing Parametric Ranking Models via Rank-Breaking , 2014, ICML.

[33]  Harold Pashler,et al.  Decontaminating human judgments by removing sequential dependencies , 2010, NIPS 2010.

[34]  Gerardo Hermosillo,et al.  Supervised learning from multiple experts: whom to trust when everyone lies a bit , 2009, ICML '09.

[35]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[36]  Joel Young,et al.  Leveraging In-Batch Annotation Bias for Crowdsourced Active Learning , 2015, WSDM.

[37]  Ben Carterette,et al.  The effect of assessor error on IR system evaluation , 2010, SIGIR.

[38]  Maksims Volkovs,et al.  A flexible generative model for preference aggregation , 2012, WWW.

[39]  K. Arrow,et al.  Social Choice and Individual Values , 1951 .

[40]  Paul N. Bennett,et al.  Pairwise ranking aggregation in a crowdsourced setting , 2013, WSDM.