Exploring Crowd Consistency in a Mechanical Turk Survey

Crowdsourcing can provide a platform for evaluating software engineering research. In this paper, we aim to explore characteristics of the worker population on Amazon's Mechanical Turk, a popular micro task crowdsourcing environment, and measure the percentage of workers who are potentially qualified to perform software- or computer science- related tasks. Through a baseline survey and two replications, we measure workers' answer consistency as well as the consistency of sample characteristics. In the end, we deployed 1,200 total surveys that were completed by 1,064 unique workers. Our results show that 24% of the study participants have a computer science or IT background and most people are payment driven when choosing tasks. The sample characteristics can vary significantly, even on large samples with 300 participants. Additionally, we often observed inconsistency in workers' answers for those who completed two surveys; approximately 30% answered at least one question inconsistently between the two survey submissions. This implies a need for replication and quality controls in crowdsourced experiments.

[1]  Michael D. Ernst,et al.  Reducing the barriers to writing verified specifications , 2012, OOPSLA '12.

[2]  Lorrie Faith Cranor,et al.  Are your participants gaming the system?: screening mechanical turk workers , 2010, CHI.

[3]  Kathryn T. Stolee,et al.  Refactoring pipe-like mashups for end-user programmers , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[4]  Westley Weimer,et al.  A human study of fault localization accuracy , 2010, 2010 IEEE International Conference on Software Maintenance.

[5]  Johan A. Pouwelse,et al.  Crowdsourcing GUI Tests , 2013, 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation.

[6]  Moira C. Norrie,et al.  CrowdStudy: general toolkit for crowdsourced evaluation of web interfaces , 2013, EICS '13.

[7]  Aniket Kittur,et al.  An Assessment of Intrinsic and Extrinsic Motivation on Task Performance in Crowdsourcing Markets , 2011, ICWSM.

[8]  Westley Weimer,et al.  A human study of patch maintainability , 2012, ISSTA 2012.

[9]  Kathryn T. Stolee,et al.  Solving the Search for Source Code , 2014, ACM Trans. Softw. Eng. Methodol..

[10]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.

[11]  Yu-An Sun,et al.  Monetary Interventions in Crowdsourcing Task Switching , 2014, HCOMP.

[12]  Kathryn T. Stolee,et al.  Exploring the use of crowdsourcing to support empirical studies in software engineering , 2010, ESEM '10.

[13]  Benjamin Livshits,et al.  Program Boosting , 2015, POPL.

[14]  Natalia Juristo Juzgado,et al.  Understanding replication of experiments in software engineering: A classification , 2014, Inf. Softw. Technol..

[15]  André van der Hoek,et al.  Microtask programming: building software with a crowd , 2014, UIST.

[16]  Duncan J. Watts,et al.  Financial incentives and the "performance of crowds" , 2009, HCOMP '09.

[17]  Ben R. Newell,et al.  The average laboratory samples a population of 7,300 Amazon Mechanical Turk workers , 2015, Judgment and Decision Making.

[18]  Sung-Hee Kim,et al.  How to filter out random clickers in a crowdsourcing-based study? , 2012, BELIV '12.

[19]  Panagiotis G. Ipeirotis,et al.  Running Experiments on Amazon Mechanical Turk , 2010, Judgment and Decision Making.

[20]  Stefano Tranquillini,et al.  Keep it simple: reward and task design in crowdsourcing , 2013, CHItaly '13.

[21]  Bill Tomlinson,et al.  Who are the crowdworkers?: shifting demographics in mechanical turk , 2010, CHI Extended Abstracts.