Creation of Reliable Relevance Judgments in Information Retrieval Systems Evaluation Experimentation through Crowdsourcing: A Review

Test collection is used to evaluate the information retrieval systems in laboratory-based evaluation experimentation. In a classic setting, generating relevance judgments involves human assessors and is a costly and time consuming task. Researchers and practitioners are still being challenged in performing reliable and low-cost evaluation of retrieval systems. Crowdsourcing as a novel method of data acquisition is broadly used in many research fields. It has been proven that crowdsourcing is an inexpensive and quick solution as well as a reliable alternative for creating relevance judgments. One of the crowdsourcing applications in IR is to judge relevancy of query document pair. In order to have a successful crowdsourcing experiment, the relevance judgment tasks should be designed precisely to emphasize quality control. This paper is intended to explore different factors that have an influence on the accuracy of relevance judgments accomplished by workers and how to intensify the reliability of judgments in crowdsourcing experiment.

[1]  Mohammad Soleymani,et al.  Crowdsourcing for Affective Annotation of Video: Development of a Viewer-reported Boredom Corpus , 2010 .

[2]  Gianluca Demartini,et al.  Mechanical Cheat: Spamming Schemes and Adversarial Techniques on Crowdsourcing Platforms , 2012, CrowdSearch.

[3]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[4]  Chuang Zhang,et al.  Real-time quality control for crowdsourcing relevance evaluation , 2012, 2012 3rd IEEE International Conference on Network Infrastructure and Digital Content.

[5]  Ben Carterette,et al.  An Analysis of Assessor Behavior in Crowdsourced Preference Judgments , 2010 .

[6]  Gabriella Kazai,et al.  Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking , 2011, SIGIR.

[7]  Mark D. Smucker,et al.  The Crowd vs . the Lab : A Comparison of Crowd-Sourced and University Laboratory Participant Behavior , 2011 .

[8]  Ricardo Baeza-Yates,et al.  Modern Information Retrieval - the concepts and technology behind search, Second edition , 2011 .

[9]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[10]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[11]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[12]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.

[13]  Siddharth Suri,et al.  Conducting behavioral research on Amazon’s Mechanical Turk , 2010, Behavior research methods.

[14]  Omar Alonso,et al.  Crowdsourcing for relevance evaluation , 2008, SIGF.

[15]  Kathryn T. Stolee,et al.  Exploring the use of crowdsourcing to support empirical studies in software engineering , 2010, ESEM '10.

[16]  Bill Tomlinson,et al.  Who are the crowdworkers?: shifting demographics in mechanical turk , 2010, CHI Extended Abstracts.

[17]  Victor Kuperman,et al.  Crowdsourcing and language studies: the new generation of linguistic data , 2010, Mturk@HLT-NAACL.

[18]  Omar Alonso,et al.  Using crowdsourcing for TREC relevance assessment , 2012, Inf. Process. Manag..

[19]  Jiayu Tang,et al.  Examining the Limits of Crowdsourcing for Relevance Assessment , 2013, IEEE Internet Computing.

[20]  Rose Holley Crowdsourcing and social engagement: potential, power and freedom for libraries and users. , 2009 .

[21]  Qinghua Zhu,et al.  Evaluation on crowdsourcing research: Current status and future direction , 2012, Information Systems Frontiers.

[22]  Panagiotis G. Ipeirotis,et al.  Running Experiments on Amazon Mechanical Turk , 2010, Judgment and Decision Making.

[23]  Ingemar J. Cox,et al.  On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents , 2012, ECIR.

[24]  Matthew Lease,et al.  Crowdsourcing Document Relevance Assessment with Mechanical Turk , 2010, Mturk@HLT-NAACL.

[25]  Jeffrey Heer,et al.  Crowdsourcing graphical perception: using mechanical turk to assess visualization design , 2010, CHI.

[26]  Matthew Lease,et al.  Semi-Supervised Consensus Labeling for Crowdsourcing , 2011 .

[27]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[28]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[29]  Aniket Kittur,et al.  CrowdForge: crowdsourcing complex work , 2011, UIST.

[30]  Duncan J. Watts,et al.  Financial incentives and the "performance of crowds" , 2009, HCOMP '09.

[31]  Gabriella Kazai,et al.  An analysis of human factors and label accuracy in crowdsourcing relevance judgments , 2013, Information Retrieval.

[32]  Jenny Chen,et al.  Opportunities for Crowdsourcing Research on Amazon Mechanical Turk , 2011 .

[33]  Ricardo Baeza-Yates,et al.  Design and Implementation of Relevance Assessments Using Crowdsourcing , 2011, ECIR.

[34]  Panagiotis G. Ipeirotis Demographics of Mechanical Turk , 2010 .

[35]  Benjamin B. Bederson,et al.  Human computation: a survey and taxonomy of a growing field , 2011, CHI.

[36]  Omar Alonso,et al.  Implementing crowdsourcing-based relevance experimentation: an industrial perspective , 2013, Information Retrieval.

[37]  Chris Callison-Burch,et al.  Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk , 2009, EMNLP.

[38]  Matthew Lease,et al.  Improving Quality of Crowdsourced Labels via Probabilistic Matrix Factorization , 2012, HCOMP@AAAI.

[39]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[40]  Gabriella Kazai,et al.  In Search of Quality in Crowdsourcing for Search Engine Evaluation , 2011, ECIR.

[41]  Arjen P. de Vries,et al.  Increasing cheat robustness of crowdsourcing tasks , 2013, Information Retrieval.

[42]  Derek Greene,et al.  Using Crowdsourcing and Active Learning to Track Sentiment in Online Media , 2010, ECAI.

[43]  Pietro Perona,et al.  Online crowdsourcing: Rating annotators and obtaining cost-effective labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[44]  Hajo Hippner,et al.  Crowdsourcing , 2012, Business & Information Systems Engineering.

[45]  Iadh Ounis,et al.  Crowdsourcing a News Query Classification Dataset , 2010 .

[46]  Tobias Hoßfeld,et al.  Analyzing costs and accuracy of validation mechanisms for crowdsourcing platforms , 2013, Math. Comput. Model..

[47]  Jaime G. Carbonell,et al.  Active Learning and Crowd-Sourcing for Machine Translation , 2010, LREC.

[48]  Daren C. Brabham Crowdsourcing the Public Participation Process for Planning Projects , 2009 .

[49]  Luca de Alfaro,et al.  Reputation systems for open collaboration , 2011, Commun. ACM.

[50]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[51]  Scott R. Klemmer,et al.  Shepherding the crowd: managing and providing feedback to crowd workers , 2011, CHI Extended Abstracts.

[52]  Zihui Ge,et al.  Crowdsourcing service-level network event monitoring , 2010, SIGCOMM '10.

[53]  Sri Devi Ravana,et al.  Low-cost evaluation techniques for information retrieval systems: A review , 2013, J. Informetrics.

[54]  Chrysanthos Dellarocas,et al.  Harnessing Crowds: Mapping the Genome of Collective Intelligence , 2009 .

[55]  Klaus Krippendorff,et al.  Estimating the Reliability, Systematic Error and Random Error of Interval Data , 1970 .

[56]  John Le,et al.  Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution , 2010 .

[57]  Gabriella Kazai,et al.  Worker types and personality traits in crowdsourcing relevance labels , 2011, CIKM '11.

[58]  Panagiotis G. Ipeirotis,et al.  Repeated labeling using multiple noisy labelers , 2012, Data Mining and Knowledge Discovery.

[59]  James Davis,et al.  Evaluating and improving the usability of Mechanical Turk for low-income workers in India , 2010, ACM DEV '10.

[60]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[61]  Elisa Bertino,et al.  Quality Control in Crowdsourcing Systems: Issues and Directions , 2013, IEEE Internet Computing.

[62]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[63]  Matthew Lease,et al.  Inferring missing relevance judgments from crowd workers via probabilistic matrix factorization , 2012, SIGIR '12.

[64]  Inc. Alias-i Multilevel Bayesian Models of Categorical Data Annotation , 2008 .

[65]  Emine Yilmaz,et al.  Crowdsourcing interactions: using crowdsourcing for evaluating interactive information retrieval systems , 2012, Information Retrieval.

[66]  John Langford,et al.  CAPTCHA: Using Hard AI Problems for Security , 2003, EUROCRYPT.

[67]  Giuseppe Piro,et al.  HetNets Powered by Renewable Energy Sources: Sustainable Next-Generation Cellular Networks , 2013, IEEE Internet Computing.

[68]  Eli Blevis,et al.  A survey of crowdsourcing as a means of collaboration and the implications of crowdsourcing for interaction design , 2011, 2011 International Conference on Collaboration Technologies and Systems (CTS).

[69]  Gabriella Kazai,et al.  The face of quality in crowdsourcing relevance labels: demographics, personality and labeling accuracy , 2012, CIKM.

[70]  Jeroen B. P. Vuurens,et al.  How Much Spam Can You Take? An Analysis of Crowdsourcing Results to Increase Accuracy , 2011 .

[71]  Manuel Blum,et al.  reCAPTCHA: Human-Based Character Recognition via Web Security Measures , 2008, Science.

[72]  Arjen P. de Vries,et al.  Obtaining High-Quality Relevance Judgments Using Crowdsourcing , 2012, IEEE Internet Computing.

[73]  Björn Hartmann,et al.  Collaboratively crowdsourcing workflows with turkomatic , 2012, CSCW.

[74]  Schahram Dustdar,et al.  Modeling Rewards and Incentive Mechanisms for Social BPM , 2012, BPM.

[75]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.