论文信息 - Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution

Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution

The use of crowdsourcing platforms like Amazon Mechanical Turk for evaluating the relevance of search results has become an effective strategy that yields results quickly and inexpensively. One approach to ensure quality of worker judgments is to include an initial training period and subsequent sporadic insertion of predefined gold standard data (training data). Workers are notified or rejected when they err on the training data, and trust and quality ratings are adjusted accordingly. In this paper, we assess how this type of dynamic learning environment can affect the workers’ results in a search relevance evaluation task completed on Amazon Mechanical Turk. Specifically, we show how the distribution of training set answers impacts training of workers and aggregate quality of worker results. We conclude that in a relevance categorization task, a uniform distribution of labels across training data labels produces optimal peaks in 1) individual worker precision and 2) majority voting aggregate result accuracy.

[1] Daphne Koller,et al. Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[2] Benjamin Piwowarski,et al. Web Search Engine Evaluation Using Clickthrough Data and a User Model , 2007 .

[3] Panagiotis G. Ipeirotis,et al. Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[4] Brendan T. O'Connor,et al. Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[5] Omar Alonso,et al. Crowdsourcing for relevance evaluation , 2008, SIGF.

[6] Omar Alonso,et al. Relevance criteria for e-commerce: a crowdsourcing-based experimental analysis , 2009, SIGIR.

[7] Lorrie Faith Cranor,et al. Are your participants gaming the system?: screening mechanical turk workers , 2010, CHI.

[8] David Maxwell Chickering,et al. Here or there: preference judgments for relevance , 2008 .

[9] Gabriella Kazai,et al. On the Evaluation of the Quality of Relevance Assessments Collected through Crowdsourcing , 2009 .

[10] Carol Peters,et al. Report on the SIGIR 2009 workshop on the future of IR evaluation , 2009, SIGF.