In Search of Quality in Crowdsourcing for Search Engine Evaluation

Crowdsourcing is increasingly looked upon as a feasible alternative to traditional methods of gathering relevance labels for the evaluation of search engines, offering a solution to the scalability problem that hinders traditional approaches. However, crowdsourcing raises a range of questions regarding the quality of the resulting data. What indeed can be said about the quality of the data that is contributed by anonymous workers who are only paid cents for their efforts? Can higher pay guarantee better quality? Do better qualified workers produce higher quality labels? In this paper, we investigate these and similar questions via a series of controlled crowdsourcing experiments where we vary pay, required effort and worker qualifications and observe their effects on the resulting label quality, measured based on agreement with a gold set.

[1]  Gabriella Kazai,et al.  Overview of the INEX 2009 Book Track , 2009, INEX.

[2]  Andrew Trotman,et al.  Focused Retrieval and Evaluation, 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2009, Brisbane, Australia, December 7-9, 2009, Revised and Selected Papers , 2010, INEX.

[3]  Hajo Hippner,et al.  Crowdsourcing , 2012, Business & Information Systems Engineering.

[4]  Matthew Lease,et al.  Crowdsourcing Document Relevance Assessment with Mechanical Turk , 2010, Mturk@HLT-NAACL.

[5]  Omar Alonso,et al.  Crowdsourcing for relevance evaluation , 2008, SIGF.

[6]  Peter Bailey,et al.  Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.

[7]  Ben Carterette,et al.  An Analysis of Assessor Behavior in Crowdsourced Preference Judgments , 2010 .

[8]  Duncan J. Watts,et al.  Financial incentives and the "performance of crowds" , 2009, HCOMP '09.

[9]  Dana Chandler,et al.  Preventing Satisficing in Online Surveys: A "Kapcha" to Ensure Higher Quality Data , 2010 .

[10]  John Le,et al.  Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution , 2010 .

[11]  and software — performance evaluation , .

[12]  Ted S. Sindlinger,et al.  Crowdsourcing: Why the Power of the Crowd is Driving the Future of Business , 2010 .

[13]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.