Quality Management in Crowdsourcing using Gold Judges Behavior

Crowdsourcing relevance labels has become an accepted practice for the evaluation of IR systems, where the task of constructing a test collection is distributed over large populations of unknown users with widely varied skills and motivations. Typical methods to check and ensure the quality of the crowd's output is to inject work tasks with known answers (gold tasks) on which workers' performance can be measured. However, gold tasks are expensive to create and have limited application. A more recent trend is to monitor the workers' interactions during a task and estimate their work quality based on their behavior. In this paper, we show that without gold behavior signals that reflect trusted interaction patterns, classifiers can perform poorly, especially for complex tasks, which can lead to high quality crowd workers getting blocked while poorly performing workers remain undetected. Through a series of crowdsourcing experiments, we compare the behaviors of trained professional judges and crowd workers and then use the trained judges' behavior signals as gold behavior to train a classifier to detect poorly performing crowd workers. Our experiments show that classification accuracy almost doubles in some tasks with the use of gold behavior data.

[1]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[2]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[3]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[4]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.

[5]  Jeff Howe,et al.  Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business , 2008, Human Resource Management International Digest.

[6]  Peter Bailey,et al.  Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.

[7]  John Le,et al.  Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution , 2010 .

[8]  Stefanie Nowak,et al.  How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation , 2010, MIR '10.

[9]  Gjergji Kasneci,et al.  Bayesian Knowledge Corroboration with Logical Rules and User Feedback , 2010, ECML/PKDD.

[10]  Ben Carterette,et al.  An Analysis of Assessor Behavior in Crowdsourced Preference Judgments , 2010 .

[11]  Dana Chandler,et al.  Preventing Satisficing in Online Surveys: A "Kapcha" to Ensure Higher Quality Data , 2010 .

[12]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[13]  Lorrie Faith Cranor,et al.  Are your participants gaming the system?: screening mechanical turk workers , 2010, CHI.

[14]  Matthew Lease,et al.  Crowdsourcing Document Relevance Assessment with Mechanical Turk , 2010, Mturk@HLT-NAACL.

[15]  Roi Blanco,et al.  Repeatable and reliable search system evaluation using crowdsourcing , 2011, SIGIR.

[16]  Aaron D. Shaw,et al.  Designing incentives for inexpert human raters , 2011, CSCW.

[17]  Gabriella Kazai,et al.  In Search of Quality in Crowdsourcing for Search Engine Evaluation , 2011, ECIR.

[18]  Shipeng Yu,et al.  An Entropic Score to Rank Annotators for Crowdsourced Labeling Tasks , 2011, 2011 Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics.

[19]  Shipeng Yu,et al.  Ranking annotators for crowdsourced labeling tasks , 2011, NIPS.

[20]  Benjamin B. Bederson,et al.  Human computation: a survey and taxonomy of a growing field , 2011, CHI.

[21]  Mark Sanderson,et al.  Quantifying test collection quality based on the consistency of relevance judgements , 2011, SIGIR.

[22]  Eric Horvitz,et al.  Combining human and machine intelligence in large-scale crowdsourcing , 2012, AAMAS.

[23]  Gabriella Kazai,et al.  An analysis of human factors and label accuracy in crowdsourcing relevance judgments , 2013, Information Retrieval.

[24]  Tom Minka,et al.  How To Grade a Test Without Knowing the Answers - A Bayesian Graphical Model for Adaptive Crowdsourcing and Aptitude Testing , 2012, ICML.

[25]  Omar Alonso,et al.  Implementing crowdsourcing-based relevance experimentation: an industrial perspective , 2013, Information Retrieval.

[26]  Aniket Kittur,et al.  CrowdScape: interactively visualizing user behavior and output , 2012, UIST.

[27]  Dirk Hovy,et al.  Learning Whom to Trust with MACE , 2013, NAACL.

[28]  Michael S. Bernstein,et al.  The future of crowd work , 2013, CSCW.

[29]  Nicholas R. Jennings,et al.  Efficient budget allocation with accuracy guarantees for crowdsourcing classification tasks , 2013, AAMAS.

[30]  Phuoc Tran-Gia,et al.  Predicting result quality in Crowdsourcing using application layer monitoring , 2014, 2014 IEEE Fifth International Conference on Communications and Electronics (ICCE).