Challenges and strategies for running controlled crowdsourcing experiments

This paper reports on the challenges and lessons we learned while running controlled experiments in crowdsourcing platforms. Crowdsourcing is becoming an attractive technique to engage a diverse and large pool of subjects in experimental research, allowing researchers to achieve levels of scale and completion times that would otherwise not be feasible in lab settings. However, the scale and flexibility comes at the cost of multiple and sometimes unknown sources of bias and confounding factors that arise from technical limitations of crowdsourcing platforms and from the challenges of running controlled experiments in the "wild". In this paper, we take our experience in running systematic evaluations of task design as a motivating example to explore, describe, and quantify the potential impact of running uncontrolled crowdsourcing experiments and derive possible coping strategies. Among the challenges identified, we can mention sampling bias, controlling the assignment of subjects to experimental conditions, learning effects, and reliability of crowdsourcing results. According to our empirical studies, the impact of potential biases and confounding factors can amount to a 38\% loss in the utility of the data collected in uncontrolled settings; and it can significantly change the outcome of experiments. These issues ultimately inspired us to implement CrowdHub, a system that sits on top of major crowdsourcing platforms and allows researchers and practitioners to run controlled crowdsourcing projects.

[1]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[2]  Lydia B. Chilton,et al.  TurKit: human computation algorithms on mechanical turk , 2010, UIST.

[3]  Angli Liu,et al.  Effective Crowd Annotation for Relation Extraction , 2016, NAACL.

[4]  Michael S. Bernstein,et al.  Embracing Error to Enable Rapid Crowdsourcing , 2016, CHI.

[5]  Noah A. Smith,et al.  Crowdsourcing Annotations for Websites' Privacy Policies: Can It Really Work? , 2016, WWW.

[6]  Atsuyuki Morishima,et al.  CyLog/Crowd4U: A Declarative Platform for Complex Data-centric Crowdsourcing , 2012, Proc. VLDB Endow..

[7]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.

[8]  Kalina Bontcheva,et al.  Platform-Related Factors in Repeatability and Reproducibility of Crowdsourcing Tasks , 2019, HCOMP.

[9]  Victor Kuperman,et al.  Using Amazon Mechanical Turk for linguistic research , 2010 .

[10]  Michael S. Bernstein,et al.  The future of crowd work , 2013, CSCW.

[11]  Tim Kraska,et al.  CrowdQ: Crowdsourced Query Understanding , 2013, CIDR.

[12]  Dong Nguyen,et al.  Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment , 2014, COLING.

[13]  Alexis Battle,et al.  The jabberwocky programming environment for structured social computing , 2011, UIST.

[14]  Hugo Paredes,et al.  SciCrowd: Towards a Hybrid, Crowd-Computing System for Supporting Research Groups in Academic Settings , 2018, CRIWG.

[15]  Michael S. Bernstein,et al.  Fair Work: Crowd Work Minimum Wage with One Line of Code , 2019, HCOMP.

[16]  Bipin Indurkhya,et al.  Cognitively inspired task design to improve user performance on crowdsourcing platforms , 2014, CHI.

[17]  Martin Schader,et al.  Exploring task properties in crowdsourcing - an empirical study on mechanical turk , 2011, ECIS.

[18]  Carsten Eickhoff,et al.  Cognitive Biases in Crowdsourcing , 2018, WSDM.

[19]  Brent J. Hecht,et al.  Turkers, Scholars, "Arafat" and "Peace": Cultural Communities and Algorithmic Gold Standards , 2015, CSCW.

[20]  Fabio Casati,et al.  Understanding the Impact of Text Highlighting in Crowdsourcing Tasks , 2019, HCOMP.

[21]  Praveen Paritosh,et al.  Human Computation Must Be Reproducible , 2012, CrowdSearch.

[22]  R Sowmya Rao,et al.  Stopping rules for surveys with multiple waves of nonrespondent follow‐up , 2008, Statistics in medicine.

[23]  Bruno Pouliquen,et al.  Sentiment Analysis in the News , 2010, LREC.

[24]  Marco Basaldella,et al.  Crowdsourcing Relevance Assessments: The Unexpected Benefits of Limiting the Time to Judge , 2016, HCOMP.

[25]  Tim Kraska,et al.  CrowdDB: answering queries with crowdsourcing , 2011, SIGMOD '11.

[26]  Jennifer Widom,et al.  Understanding Workers, Developing Effective Tasks, and Enhancing Marketplace Dynamics: A Study of a Large Crowdsourcing Marketplace , 2017, Proc. VLDB Endow..

[27]  Alessandro Bozzon,et al.  Clarity is a Worthwhile Quality: On the Role of Task Clarity in Microtask Crowdsourcing , 2017, HT.

[28]  Kalina Bontcheva,et al.  Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines , 2014, LREC.

[29]  Fabio Casati,et al.  CrowdRev : A platform for Crowd-based Screening of Literature , 2018 .

[30]  Michael S. Bernstein,et al.  Soylent: a word processor with a crowd inside , 2010, UIST.

[31]  Aniket Kittur,et al.  CrowdForge: crowdsourcing complex work , 2011, UIST.

[32]  Duncan J. Watts,et al.  Financial incentives and the "performance of crowds" , 2009, HCOMP '09.

[33]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[34]  Fabio Casati,et al.  Crowdsourced dataset to study the generation and impact of text highlighting in classification tasks , 2019, BMC Research Notes.

[35]  Ricardo Kawase,et al.  Improving Reliability of Crowdsourced Results by Detecting Crowd Workers with Multiple Identities , 2017, ICWE.

[36]  Kathryn T. Stolee,et al.  Exploring Crowd Consistency in a Mechanical Turk Survey , 2016, 2016 IEEE/ACM 3rd International Workshop on CrowdSourcing in Software Engineering (CSI-SE).

[37]  Andrew McGregor,et al.  AutoMan: a platform for integrating human-based and digital computation , 2012, OOPSLA '12.

[38]  Justin Cheng,et al.  How annotation styles influence content and preferences , 2013, HT '13.

[39]  Aleksandrs Slivkins,et al.  Incentivizing high quality crowdwork , 2015, SECO.

[40]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[41]  Fabio Casati,et al.  CrowdHub: Extending crowdsourcing platforms for the controlled evaluation of tasks designs , 2019, ArXiv.

[42]  Mon-Chu Chen,et al.  Rehumanized Crowdsourcing: A Labeling Framework Addressing Bias and Ethics in Machine Learning , 2019, CHI.

[43]  Björn Hartmann,et al.  Collaboratively crowdsourcing workflows with turkomatic , 2012, CSCW.

[44]  Panagiotis G. Ipeirotis,et al.  Demographics and Dynamics of Mechanical Turk Workers , 2018, WSDM.

[45]  P. Shekelle,et al.  Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015: elaboration and explanation , 2015, BMJ : British Medical Journal.

[46]  Todd M. Gureckis,et al.  CUNY Academic , 2016 .

[47]  Panagiotis G. Ipeirotis,et al.  Running Experiments on Amazon Mechanical Turk , 2010, Judgment and Decision Making.

[48]  David R. Karger,et al.  Human-powered Sorts and Joins , 2011, Proc. VLDB Endow..

[49]  Alexander J. Quinn,et al.  Confusing the Crowd: Task Instruction Quality on Amazon Mechanical Turk , 2017, HCOMP.

[50]  Abraham Bernstein,et al.  CrowdLang: A Programming Language for the Systematic Exploration of Human Computation Systems , 2012, SocInfo.

[51]  Roi Blanco,et al.  Repeatable and reliable search system evaluation using crowdsourcing , 2011, SIGIR.

[52]  Krzysztof Z. Gajos,et al.  TurkServer: Enabling Synchronous and Longitudinal Online Experiments , 2012, HCOMP@AAAI.

[53]  Jennifer Widom,et al.  Deco: declarative crowdsourcing , 2012, CIKM.