Enhancing reliability using peer consistency evaluation in human computation

Peer consistency evaluation is often used in games with a purpose (GWAP) to evaluate workers using outputs of other workers without using gold standard answers. Despite its popularity, the reliability of peer consistency evaluation has never been systematically tested to show how it can be used as a general evaluation method in human computation systems. We present experimental results that show that human computation systems using peer consistency evaluation can lead to outcomes that are even better than those that evaluate workers using gold standard answers. We also show that even without evaluation, simply telling the workers that their answers will be used as future evaluation standards can significantly enhance the workers' performance. Results have important implication for methods that improve the reliability of human computation systems.

[1]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[2]  Benjamin B. Bederson,et al.  Human computation: a survey and taxonomy of a growing field , 2011, CHI.

[3]  Luis von Ahn,et al.  Human computation , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[4]  Björn Hartmann,et al.  Collaboratively crowdsourcing workflows with turkomatic , 2012, CSCW.

[5]  Lydia B. Chilton,et al.  TurKit: Tools for iterative tasks on mechanical turk , 2009, 2009 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[6]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[7]  Laura A. Dabbish,et al.  Designing games with a purpose , 2008, CACM.

[8]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[9]  Aniket Kittur,et al.  CrowdForge: crowdsourcing complex work , 2011, UIST.

[10]  Stephen E. Robertson,et al.  Rethinking the ESP game , 2009, CHI Extended Abstracts.

[11]  Mausam,et al.  Crowdsourcing Control: Moving Beyond Multiple Choice , 2012, UAI.

[12]  Krzysztof Z. Gajos,et al.  Toward automatic task design: a progress report , 2010, HCOMP '10.

[13]  Christopher G. Harris You're Hired! An Examination of Crowdsourcing Incentive Models in Human Resource Tasks , 2011 .

[14]  Phuoc Tran-Gia,et al.  Cost-Optimal Validation Mechanisms and Cheat-Detection for Crowdsourcing Platforms , 2011, 2011 Fifth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.

[15]  Duncan J. Watts,et al.  Financial incentives and the "performance of crowds" , 2009, HCOMP '09.

[16]  Scott R. Klemmer,et al.  Shepherding the crowd yields better work , 2012, CSCW.

[17]  Mausam,et al.  Dynamically Switching between Synergistic Workflows for Crowdsourcing , 2012, AAAI.

[18]  Lukas Biewald,et al.  Programmatic Gold: Targeted and Scalable Quality Assurance in Crowdsourcing , 2011, Human Computation.

[19]  Wai-Tat Fu,et al.  Systematic Analysis of Output Agreement Games: Effects of Gaming Environment, Social Interaction, and Feedback , 2012, HCOMP@AAAI.

[20]  John Langford,et al.  CAPTCHA: Using Hard AI Problems for Security , 2003, EUROCRYPT.

[21]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[22]  Luis von Ahn,et al.  Word sense disambiguation via human computation , 2010, HCOMP '10.

[23]  Praveen Paritosh,et al.  Human Computation Must Be Reproducible , 2012, CrowdSearch.

[24]  John Le,et al.  Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution , 2010 .

[25]  Shourya Roy,et al.  Beyond Independent Agreement: A Tournament Selection Approach for Quality Assurance of Human Computation Tasks , 2011, Human Computation.

[26]  A. Rustichini,et al.  Pay Enough or Don't Pay at All , 2000 .

[27]  Aniket Kittur,et al.  CrowdWeaver: visually managing complex crowd work , 2012, CSCW.

[28]  O. Bandiera,et al.  Social preferences and the response to incentives: Evidence from personnel data , 2005 .

[29]  Panagiotis G. Ipeirotis Analyzing the Amazon Mechanical Turk marketplace , 2010, XRDS.

[30]  Haoqi Zhang,et al.  An Iterative Dual Pathway Structure for Speech-to-Text Transcription , 2011, Human Computation.