An analysis of human factors and label accuracy in crowdsourcing relevance judgments

Crowdsourcing relevance judgments for the evaluation of search engines is used increasingly to overcome the issue of scalability that hinders traditional approaches relying on a fixed group of trusted expert judges. However, the benefits of crowdsourcing come with risks due to the engagement of a self-forming group of individuals—the crowd, motivated by different incentives, who complete the tasks with varying levels of attention and success. This increases the need for a careful design of crowdsourcing tasks that attracts the right crowd for the given task and promotes quality work. In this paper, we describe a series of experiments using Amazon’s Mechanical Turk, conducted to explore the ‘human’ characteristics of the crowds involved in a relevance assessment task. In the experiments, we vary the level of pay offered, the effort required to complete a task and the qualifications required of the workers. We observe the effects of these variables on the quality of the resulting relevance labels, measured based on agreement with a gold set, and correlate them with self-reported measures of various human factors. We elicit information from the workers about their motivations, interest and familiarity with the topic, perceived task difficulty, and satisfaction with the offered pay. We investigate how these factors combine with aspects of the task design and how they affect the accuracy of the resulting relevance labels. Based on the analysis of 960 HITs and 2,880 HIT assignments resulting in 19,200 relevance labels, we arrive at insights into the complex interaction of the observed factors and provide practical guidelines to crowdsourcing practitioners. In addition, we highlight challenges in the data analysis that stem from the peculiarity of the crowdsourcing environment where the sample of individuals engaged in specific work conditions are inherently influenced by the conditions themselves.

[1]  John Le,et al.  Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution , 2010 .

[2]  Ben Carterette,et al.  The effect of assessor error on IR system evaluation , 2010, SIGIR.

[3]  Gabriella Kazai,et al.  Worker types and personality traits in crowdsourcing relevance labels , 2011, CIKM '11.

[4]  Ben Carterette,et al.  An Analysis of Assessor Behavior in Crowdsourced Preference Judgments , 2010 .

[5]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.

[6]  Lorrie Faith Cranor,et al.  Are your participants gaming the system?: screening mechanical turk workers , 2010, CHI.

[7]  Dana Chandler,et al.  Preventing Satisficing in Online Surveys: A "Kapcha" to Ensure Higher Quality Data , 2010 .

[8]  Daniel M. Russell,et al.  Query logs alone are not enough , 2007 .

[9]  and software — performance evaluation , .

[10]  Stefanie Nowak,et al.  How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation , 2010, MIR '10.

[11]  Brian A Vander Schee Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business , 2009 .

[12]  Matthew Lease,et al.  On Quality Control and Machine Learning in Crowdsourcing , 2011, Human Computation.

[13]  Aniket Kittur,et al.  Instrumenting the crowd: using implicit behavioral measures to predict task performance , 2011, UIST.

[14]  Phuoc Tran-Gia,et al.  Anatomy of a Crowdsourcing Platform - Using the Example of Microworkers.com , 2011, 2011 Fifth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.

[15]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[16]  Siddharth Suri,et al.  Conducting behavioral research on Amazon’s Mechanical Turk , 2010, Behavior research methods.

[17]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[18]  Omar Alonso,et al.  Crowdsourcing for relevance evaluation , 2008, SIGF.

[19]  Bill Tomlinson,et al.  Who are the crowdworkers?: shifting demographics in mechanical turk , 2010, CHI Extended Abstracts.

[20]  Ray C. Fair,et al.  Principles of economics , 1993 .

[21]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[22]  L. Festinger,et al.  Cognitive consequences of forced compliance. , 2011, Journal of abnormal psychology.

[23]  Tara S. Behrend,et al.  The viability of crowdsourcing for survey research , 2011, Behavior research methods.

[24]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[25]  David C. Parkes,et al.  The role of game theory in human computation systems , 2009, HCOMP '09.

[26]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[27]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[28]  Matthew Lease,et al.  Crowdsourcing for search evaluation , 2011, SIGF.

[29]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[30]  Omar Alonso,et al.  Crowdsourcing Assessments for XML Ranked Retrieval , 2010, ECIR.

[31]  Ricardo Baeza-Yates,et al.  Design and Implementation of Relevance Assessments Using Crowdsourcing , 2011, ECIR.

[32]  Gabriella Kazai,et al.  Towards methods for the collective gathering and quality control of relevance assessments , 2009, SIGIR.

[33]  Jeroen B. P. Vuurens,et al.  How Much Spam Can You Take? An Analysis of Crowdsourcing Results to Increase Accuracy , 2011 .

[34]  Aaron D. Shaw,et al.  Designing incentives for inexpert human raters , 2011, CSCW.

[35]  Peter Bailey,et al.  Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.

[36]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[37]  Gabriella Kazai,et al.  Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking , 2011, SIGIR.

[38]  Bill Tomlinson,et al.  Sellers' problems in human computation markets , 2010, HCOMP '10.

[39]  Gjergji Kasneci,et al.  Bayesian Knowledge Corroboration with Logical Rules and User Feedback , 2010, ECML/PKDD.

[40]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[41]  Elizabeth F. Churchill,et al.  Logging the Search Self-Efficacy of Amazon Mechanical Turkers , 2010 .

[42]  Gabriella Kazai,et al.  Overview of the INEX 2008 Book Track , 2009, INEX.

[43]  Panagiotis G. Ipeirotis Demographics of Mechanical Turk , 2010 .

[44]  Benjamin B. Bederson,et al.  Human computation: a survey and taxonomy of a growing field , 2011, CHI.

[45]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[46]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[47]  Panagiotis G. Ipeirotis Analyzing the Amazon Mechanical Turk marketplace , 2010, XRDS.

[48]  Alon Y. Halevy,et al.  Crowdsourcing systems on the World-Wide Web , 2011, Commun. ACM.

[49]  Gabriella Kazai,et al.  Overview of the TREC 2012 Crowdsourcing Track , 2012, TREC.

[50]  Hajo Hippner,et al.  Crowdsourcing , 2012, Business & Information Systems Engineering.

[51]  Matthew Lease,et al.  Crowdsourcing Document Relevance Assessment with Mechanical Turk , 2010, Mturk@HLT-NAACL.

[52]  A. N. Oppenheim Questionnaire Design and Attitude Measurement , 1966 .

[53]  Duncan J. Watts,et al.  Financial incentives and the "performance of crowds" , 2009, HCOMP '09.

[54]  Andrew Trotman,et al.  Comparative analysis of clicks and judgments for IR evaluation , 2009, WSCD '09.

[55]  Pietro Perona,et al.  The Multidimensional Wisdom of Crowds , 2010, NIPS.

[56]  A. P. deVries,et al.  How Crowdsourcable is Your Task , 2011 .

[57]  Gabriella Kazai,et al.  In Search of Quality in Crowdsourcing for Search Engine Evaluation , 2011, ECIR.