Comparing Person- and Process-centric Strategies for Obtaining Quality Data on Amazon Mechanical Turk

In the past half-decade, Amazon Mechanical Turk has radically changed the way many scholars do research. The availability of a massive, distributed, anonymous crowd of individuals willing to perform general human-intelligence micro-tasks for micro-payments is a valuable resource for researchers and practitioners. This paper addresses the challenges of obtaining quality annotations for subjective judgment oriented tasks of varying difficulty. We design and conduct a large, controlled experiment (N=68,000) to measure the efficacy of selected strategies for obtaining high quality data annotations from non-experts. Our results point to the advantages of person-oriented strategies over process-oriented strategies. Specifically, we find that screening workers for requisite cognitive aptitudes and providing training in qualitative coding techniques is quite effective, significantly outperforming control and baseline conditions. Interestingly, such strategies can improve coder annotation accuracy above and beyond common benchmark strategies such as Bayesian Truth Serum (BTS).

[1]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[2]  Mor Naaman,et al.  Is it really about me?: message content in social awareness streams , 2010, CSCW '10.

[3]  Gabriella Kazai,et al.  Worker types and personality traits in crowdsourcing relevance labels , 2011, CIKM '11.

[4]  Lydia B. Chilton,et al.  Exploring iterative and parallel human computation processes , 2010, HCOMP '10.

[5]  Matthew Lease,et al.  SQUARE: A Benchmark for Research on Computing Crowd Consensus , 2013, HCOMP.

[6]  Panagiotis G. Ipeirotis,et al.  Running Experiments on Amazon Mechanical Turk , 2010, Judgment and Decision Making.

[7]  Aaron D. Shaw,et al.  Designing incentives for inexpert human raters , 2011, CSCW.

[8]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[9]  Aniket Kittur,et al.  Crowd synthesis: extracting categories and clusters from complex data , 2014, CSCW.

[10]  Lorrie Faith Cranor,et al.  Are your participants gaming the system?: screening mechanical turk workers , 2010, CHI.

[11]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[12]  Panagiotis G. Ipeirotis Analyzing the Amazon Mechanical Turk marketplace , 2010, XRDS.

[13]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[14]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15]  D. Prelec A Bayesian Truth Serum for Subjective Data , 2004, Science.

[16]  Jeffrey Heer,et al.  Strategies for crowdsourcing social data analysis , 2012, CHI.

[17]  Scott Counts,et al.  Tweeting is believing?: understanding microblog credibility perceptions , 2012, CSCW.

[18]  Barbara Poblete,et al.  Information credibility on twitter , 2011, WWW.

[19]  Donald B. Rubin,et al.  Multiple Contrasts and Ordered Bonferroni Procedures , 1984 .

[20]  Eric Gilbert,et al.  What if we ask a different question?: social inferences create product ratings faster , 2014, CHI.

[21]  A. Acquisti,et al.  Reputation as a sufficient condition for data quality on Amazon Mechanical Turk , 2013, Behavior Research Methods.

[22]  Duncan J. Watts,et al.  Financial incentives and the "performance of crowds" , 2009, HCOMP '09.

[23]  Shourya Roy,et al.  Beyond Independent Agreement: A Tournament Selection Approach for Quality Assurance of Human Computation Tasks , 2011, Human Computation.

[24]  Alan Borning,et al.  Integrating on-demand fact-checking with public dialogue , 2014, CSCW.

[25]  Wai-Tat Fu,et al.  Don't hide in the crowd!: increasing social transparency between peer workers improves crowdsourcing outcomes , 2013, CHI.

[26]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[27]  Dragomir R. Radev,et al.  Rumor has it: Identifying Misinformation in Microblogs , 2011, EMNLP.

[28]  Eric Gilbert,et al.  Modeling Factuality Judgments in Social Media Text , 2014, ACL.

[29]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[30]  Adam J. Berinsky,et al.  Evaluating Online Labor Markets for Experimental Research: Amazon.com's Mechanical Turk , 2012, Political Analysis.

[31]  Danai Koutra,et al.  Glance: rapidly coding behavioral video with the crowd , 2014, UIST.

[32]  John M. Levine,et al.  To stay or leave?: the relationship of emotional and informational support to commitment in online health support groups , 2012, CSCW.

[33]  Johnny Saldaña,et al.  The Coding Manual for Qualitative Researchers , 2009 .

[34]  Timothy Baldwin,et al.  On-line Trend Analysis with Topic Models: #twitter Trends Detection Topic Model Online , 2012, COLING.

[35]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[36]  R. C. Gardner,et al.  Type I Error Rate Comparisons of Post Hoc Procedures for I j Chi-Square Tables , 2000 .

[37]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[38]  Tomaso A. Poggio,et al.  A Trainable System for Object Detection , 2000, International Journal of Computer Vision.

[39]  S. Hart,et al.  Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research , 1988 .

[40]  Jeffrey Heer,et al.  Crowdsourcing graphical perception: using mechanical turk to assess visualization design , 2010, CHI.

[41]  Michael Vitale,et al.  The Wisdom of Crowds , 2015, Cell.

[42]  David A. Forsyth,et al.  Utility data annotation with Amazon Mechanical Turk , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[43]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.

[44]  Siddharth Suri,et al.  Conducting behavioral research on Amazon’s Mechanical Turk , 2010, Behavior research methods.

[45]  Todd M. Gureckis,et al.  CUNY Academic , 2016 .

[46]  Eric Gilbert,et al.  VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text , 2014, ICWSM.