A Discriminative Approach to Predicting Assessor Accuracy

Modeling changes in individual relevance assessor performance over time offers new ways to improve the quality of relevance judgments, such as by dynamically routing judging tasks to assessors more likely to produce reliable judgments. Whereas prior assessor models have typically adopted a single generative approach, we formulate a discriminative, flexible feature-based model. This allows us to combine multiple generative models and integrate additional behavioral evidence, enabling better adaptation to temporal variance in assessor accuracy. Experiments using crowd assessor data from the NIST TREC 2011 Crowdsourcing Track show our model improves prediction accuracy by 26-36% across assessors, enabling 29-47% improved quality of relevance judgments to be collected at 17-45% lower cost.

[1]  Aniket Kittur,et al.  Instrumenting the crowd: using implicit behavioral measures to predict task performance , 2011, UIST.

[2]  C. Buckley,et al.  Overview of the TREC 2010 Relevance Feedback Track ( Notebook ) , 2010 .

[3]  Gabriella Kazai,et al.  In Search of Quality in Crowdsourcing for Search Engine Evaluation , 2011, ECIR.

[4]  Kwong-Sak Leung,et al.  Task recommendation in crowdsourcing systems , 2012, CrowdKDD '12.

[5]  Ingemar J. Cox,et al.  On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents , 2012, ECIR.

[6]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[7]  Omar Alonso,et al.  Crowdsourcing for relevance evaluation , 2008, SIGF.

[8]  Paul N. Bennett,et al.  The effects of choice in routing relevance judgments , 2011, SIGIR '11.

[9]  Mark D. Smucker,et al.  Measuring assessor accuracy: a comparison of nist assessors and user study participants , 2011, SIGIR '11.

[10]  Matthew Lease,et al.  Predicting Next Label Quality: A Time-Series Model of Crowdwork , 2014, HCOMP.

[11]  Meeyoung Cha,et al.  Social bootstrapping: how pinterest and last.fm social communities benefit by borrowing links from facebook , 2014, WWW.

[12]  Ben Carterette,et al.  The effect of assessor error on IR system evaluation , 2010, SIGIR.

[13]  Gabriella Kazai,et al.  Overview of the TREC 2012 Crowdsourcing Track , 2012, TREC.

[14]  Fabio Roli,et al.  Multi-label classification with a reject option , 2013, Pattern Recognit..

[15]  Gabriella Kazai,et al.  The face of quality in crowdsourcing relevance labels: demographics, personality and labeling accuracy , 2012, CIKM.

[16]  Jaime G. Carbonell,et al.  A Probabilistic Framework to Learn from Multiple Annotators with Time-Varying Accuracy , 2010, SDM.

[17]  Arjen P. de Vries,et al.  Obtaining High-Quality Relevance Judgments Using Crowdsourcing , 2012, IEEE Internet Computing.

[18]  Shipeng Yu,et al.  Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks , 2012, J. Mach. Learn. Res..

[19]  Panagiotis G. Ipeirotis,et al.  Quizz: targeted crowdsourcing with a billion (potential) users , 2014, WWW.