论文信息 - On the Evaluation of the Quality of Relevance Assessments Collected through Crowdsourcing

On the Evaluation of the Quality of Relevance Assessments Collected through Crowdsourcing

stablished methods for evaluating information retrieval systems rely upon test collections that comprise document corpora, search topics, and relevance assessments. Building large test collections is, however, an expensive and increasingly challenging process. In particular, building a collection with a sufficient quantity and quality of relevance assessments is a major challenge. With the growing size of document corpora, it is inevitable that relevance assessments are increasingly incomplete, diminishing the value of the test collections. Recent initiatives aim to address this issue through crowdsourcing. Such techniques harness the problem- solving power of large groups of people who are compensated for their efforts monetarily, through community recognition, or by the entertaining experience. However, the diverse backgrounds of the assessors and the incentives of the crowdsourcing models directly influence the trustworthiness and the quality of the resulting data. Currently there are no established methods to measure the quality of the collected relevance assessments. In this paper, we discuss the components that could be used to devise such measures. Our recommendations are based on experiments with collecting relevance assessments for digitized books, conducted as part of the INEX Book Track in 2008. Keywordscollection construction, relevance judgments, incentives, social game, quality assessment.

Gabriella Kazai | Natasa Milic-Frayling

[1] Andrew Trotman,et al. Overview of the INEX 2007 Ad Hoc Track , 2008, INEX.

[2] Omar Alonso,et al. Crowdsourcing for relevance evaluation , 2008, SIGF.

[3] Gabriella Kazai,et al. Towards methods for the collective gathering and quality control of relevance assessments , 2009, SIGIR.

[4] Peter Bailey,et al. Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.

[5] A. Trotman. IR Evaluation Using Multiple Assessors per Topic , 2007 .

[6] Gabriella Kazai,et al. Overview of the INEX 2008 Book Track , 2009, INEX.

[7] Laura A. Dabbish,et al. Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.