Validity and Mechanical Turk: An assessment of exclusion methods and interactive experiments

Social science researchers increasingly recruit participants through Amazon's Mechanical Turk (MTurk) platform. Yet, the physical isolation of MTurk participants, and perceived lack of experimental control have led to persistent concerns about the quality of the data that can be obtained from MTurk samples. In this paper we focus on two of the most salient concernsthat MTurk participants may not buy into interactive experiments and that they may produce unreliable or invalid data. We review existing research on these topics and present new data to address these concerns. We find that insufficient attention is no more a problem among MTurk samples than among other commonly used convenience or high-quality commercial samples, and that MTurk participants buy into interactive experiments and trust researchers as much as participants in laboratory studies. Furthermore, we find that employing rigorous exclusion methods consistently boosts statistical power without introducing problematic side effects (e.g., substantially biasing the post-exclusion sample), and can thus provide a general solution for dealing with problematic respondents across samples. We conclude with a discussion of best practices and recommendations. Online participant recruitment has led to persistent concerns about data quality.Online participants are just as attentive as participants recruited offline.Online participants buy into experimental social interactions as much as in the lab.Rigorous exclusion methods can be used to improve data quality online and offline.

[1]  Brian A. Nosek,et al.  Motivated Independence? Implicit Party Identity Predicts Political Judgments Among Self-Proclaimed Independents , 2012, Personality & social psychology bulletin.

[2]  Fabio Wasserfallen,et al.  Learning and the diffusion of regime contention in the Arab Spring , 2015 .

[3]  B. Fischhoff,et al.  Judgment and decision making. , 2012, Wiley interdisciplinary reviews. Cognitive science.

[4]  Joseph M. Moran,et al.  Do I amuse you? Asymmetric predictors for humor appreciation and humor production , 2014 .

[5]  Brian P. Meier,et al.  Spatial Metaphor and Real Estate , 2011 .

[6]  Abigail B. Sussman,et al.  The Exception Is the Rule: Underestimating and Overspending on Exceptional Expenses , 2012 .

[7]  J. Savulescu,et al.  Attitudes of Lay People to Withdrawal of Treatment in Brain Damaged Patients , 2013, Neuroethics.

[8]  David J. Hardisty,et al.  How to measure time preferences: An experimental comparison of three methods , 2013, Judgment and Decision Making.

[9]  Michael C. Frank,et al.  Ad-hoc scalar implicature in adults and children , 2011, CogSci.

[10]  Duncan J. Watts,et al.  Cooperation and Contagion in Web-Based, Networked Public Goods Experiments , 2010, SECO.

[11]  Krista Casler,et al.  Separate but equal? A comparison of participants and data gathered via Amazon's MTurk, social media, and face-to-face behavioral testing , 2013, Comput. Hum. Behav..

[12]  Bob Rehder,et al.  Functions in biological kind classification , 2012, Cognitive Psychology.

[13]  Amar Cheema,et al.  Data collection in a flat world: the strengths and weaknesses of mechanical turk samples , 2013 .

[14]  Leif D. Nelson,et al.  Intuitive confidence: choosing between intuitive and nonintuitive alternatives. , 2006, Journal of experimental psychology. General.

[15]  K. Lyle,et al.  When deal depth doesn't matter: How handedness consistency influences consumer response to horizontal versus vertical price comparisons , 2015 .

[16]  D. Tingley,et al.  “Who are these people?” Evaluating the demographic characteristics and political preferences of MTurk survey respondents , 2015 .

[17]  D. Paulhus,et al.  The FAD–Plus: Measuring Lay Beliefs Regarding Free Will and Related Constructs , 2011, Journal of personality assessment.

[18]  Todd M. Gureckis,et al.  CUNY Academic , 2016 .

[19]  Panagiotis G. Ipeirotis,et al.  Running Experiments on Amazon Mechanical Turk , 2010, Judgment and Decision Making.

[20]  Michael D. Buhrmester,et al.  Amazon's Mechanical Turk , 2011, Perspectives on psychological science : a journal of the Association for Psychological Science.

[21]  Scott Clifford,et al.  Do Attempts to Improve Respondent Attention Increase Social Desirability Bias , 2015 .

[22]  Haotian Zhou,et al.  of Personality and Social Psychology The Pitfall of Experimenting on the Web : How Unattended Selective Attrition Leads to Surprising ( Yet False ) Research Conclusions , 2016 .

[23]  Steven V. Rouse,et al.  A reliability analysis of Mechanical Turk data , 2015, Comput. Hum. Behav..

[24]  Daniel N. Jones,et al.  Introducing the Short Dark Triad (SD3) , 2014, Assessment.

[25]  Adam J. Berinsky,et al.  Separating the Shirkers from the Workers? Making Sure Respondents Pay Attention on Self‐Administered Surveys , 2014 .

[26]  Jacob Ausderan How naming and shaming affects human rights perceptions in the shamed country , 2014 .

[27]  David G. Rand,et al.  Economic Games on the Internet: The Effect of $1 Stakes , 2011, PloS one.

[28]  H. Baumgartner,et al.  Reversed item bias: an integrative model. , 2013, Psychological methods.

[29]  Joshua Knobe,et al.  More than a body: mind perception and the nature of objectification. , 2011, Journal of personality and social psychology.

[30]  Anton Strezhnev,et al.  Investigator Characteristics and Respondent Behavior in Online Surveys , 2018, Journal of Experimental Political Science.

[31]  Panagiotis G. Ipeirotis Demographics of Mechanical Turk , 2010 .

[32]  Elizabeth W. Dunn,et al.  Does Affluence Impoverish the Experience of Parenting , 2012 .

[33]  J. Darley,et al.  Political Ideology and Reactions to Crime Victims: Preferences for Restorative and Punitive Responses , 2011 .

[34]  Brian A. Nosek,et al.  Group-Based Dominance and Opposition to Equality Correspond to Different Psychological Motives , 2010 .

[35]  Thomas J. Leeper,et al.  The Generalizability of Survey Experiments* , 2015, Journal of Experimental Political Science.

[36]  Katharina Reinecke,et al.  Crowdsourcing performance evaluations of user interfaces , 2013, CHI.

[37]  R. Bergh,et al.  “Not One of Us” , 2014, Personality & social psychology bulletin.

[38]  Jesse Chandler,et al.  Using Mechanical Turk to Study Clinical Populations , 2013 .

[39]  Jeffrey C. Zemla,et al.  Missing the trees for the forest: a construal level account of the illusion of explanatory depth. , 2010, Journal of personality and social psychology.

[40]  Brian A. Nosek,et al.  An Open, Large-Scale, Collaborative Effort to Estimate the Reproducibility of Psychological Science , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[41]  A. Meade,et al.  Identifying careless responses in survey data. , 2012, Psychological methods.

[42]  K. Teigen,et al.  How fast can you (possibly) do it, or how long will it (certainly) take? Communicating uncertain estimates of performance time. , 2014, Acta psychologica.

[43]  Leigh S. Wilton,et al.  Perceiving a Presidency in Black (and White): Four Years Later , 2014 .

[44]  David J. Hauser,et al.  It’s a Trap! Instructional Manipulation Checks Prompt Systematic Thinking on “Tricky” Tasks , 2015 .

[45]  P. Rosenbaum The Consequences of Adjustment for a Concomitant Variable that Has Been Affected by the Treatment , 1984 .

[46]  Jonathon P. Schuldt,et al.  of acceSSibiliTy and aPPlicabiliTy: hoW heaT- relaTed cueS affecT belief in "global WarMing" verSuS "cliMa Te change" , 2014 .

[47]  Amy Weinberg,et al.  Crowdsourcing syntactic relatedness judgements for opinion mining in the study of information technology adoption , 2011, LaTeCH@ACL.

[48]  Daniel M. Oppenheimer,et al.  Instructional Manipulation Checks: Detecting Satisficing to Increase Statistical Power , 2009 .

[49]  Natalia Karelaia,et al.  When Deviant Leaders are Punished more than Non-leaders: The Role of Deviance Severity , 2013 .

[50]  Kevin Arceneaux Can Partisan Cues Diminish Democratic Accountability? , 2008 .

[51]  David G. Rand,et al.  Spontaneous giving and calculated greed , 2012, Nature.

[52]  Thomas J. Leeper Crowdsourced Data Preprocessing with R and Amazon Mechanical Turk , 2016, R J..

[53]  J. E. Kurtz,et al.  Semantic Response Consistency and Protocol Validity in Structured Personality Assessment: The Case of the NEO-PI-R , 2001, Journal of personality assessment.

[54]  N. A. Barnicot,et al.  Physical Anthropology: A Science of Human Biology. (Book Reviews: Human Biology. An introduction to human evolution, variation, and growth) , 1965 .

[55]  Jesse Chandler,et al.  Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers , 2013, Behavior Research Methods.

[56]  J. Moser,et al.  Self-talk as a regulatory mechanism: how you do it matters. , 2014, Journal of personality and social psychology.

[57]  David J. Hauser,et al.  Attentive Turkers: MTurk participants perform better on online attention checks than do subject pool participants , 2015, Behavior Research Methods.

[58]  Scott Clifford,et al.  Is There a Cost to Convenience? An Experimental Comparison of Data Quality in Laboratory and Online Studies , 2014, Journal of Experimental Political Science.

[59]  Siddharth Suri,et al.  Conducting behavioral research on Amazon’s Mechanical Turk , 2010, Behavior research methods.

[60]  Reginald B. Adams,et al.  Investigating Variation in Replicability: A “Many Labs” Replication Project , 2014 .

[61]  David G. Rand,et al.  The promise of Mechanical Turk: how online labor markets can help theorists run behavioral experiments. , 2012, Journal of theoretical biology.

[62]  Caglar Irmak,et al.  Having versus Consuming: Failure to Estimate Usage Frequency Makes Consumers Prefer Multifeature Products , 2013 .

[63]  Steven Pinker,et al.  The psychology of coordination and common knowledge. , 2014, Journal of personality and social psychology.

[64]  A. Coppock Generalizing from Survey Experiments Conducted on Mechanical Turk: A Replication Approach , 2018, Political Science Research and Methods.

[65]  Philip D. Waggoner,et al.  Are samples drawn from Mechanical Turk valid for research on political ideology? , 2015 .

[66]  Yang Yang,et al.  Framing Influences Willingness to Pay but Not Willingness to Accept , 2013 .

[67]  Elizabeth W. Dunn,et al.  Parents Reap What They Sow , 2013 .

[68]  R. Scheines,et al.  The donor is in the details , 2013 .

[69]  Quynh Lê,et al.  The Socio-Economic and Physical Contributors to Food Insecurity in a Rural Community , 2015 .

[70]  H. Emons,et al.  What to measure? , 2012, Accreditation and Quality Assurance.

[71]  Melissa McKenzie,et al.  Psychological research in the internet age: The quality of web-based data , 2016, Comput. Hum. Behav..

[72]  Cindy D. Kam Who Toes the Party Line? Cues, Values, and Individual Differences , 2005 .

[73]  George Loewenstein,et al.  Goal gradient in helping behavior , 2013 .

[74]  Lorrie Faith Cranor,et al.  Are your participants gaming the system?: screening mechanical turk workers , 2010, CHI.

[75]  Bill Tomlinson,et al.  Who are the Turkers? Worker Demographics in Amazon Mechanical Turk , 2009 .

[76]  Ronald D. Rogge,et al.  Caring about carelessness: Participant inattention and its effects on research. , 2014 .

[77]  Brian A. Nosek,et al.  Reducing stigma toward individuals with mental illnesses: A brief, online manipulation , 2014 .

[78]  David G. Rand,et al.  The online laboratory: conducting experiments in a real labor market , 2010, ArXiv.

[79]  David P. Redlawsk,et al.  Advantages and Disadvantages of Cognitive Heuristics in Political Decision Making , 2001 .

[80]  Adam Seth Levine,et al.  Cross-Sample Comparisons and External Validity , 2014, Journal of Experimental Political Science.

[81]  Ravi Dhar,et al.  The Importance of the Context in Brand Extension: How Pictures and Comparisons Shift Consumers' Focus from Fit to Quality , 2012 .

[82]  Julie A. Kientz,et al.  Personality and Persuasive Technology: An Exploratory Study on Health-Promoting Mobile Applications , 2010, PERSUASIVE.

[83]  A. Acquisti,et al.  Reputation as a sufficient condition for data quality on Amazon Mechanical Turk , 2013, Behavior Research Methods.

[84]  Christopher R. Chartier,et al.  Pseudo-dyadic “interaction” on Amazon’s Mechanical Turk , 2013, Behavior research methods.

[85]  Adam J. Berinsky,et al.  Evaluating Online Labor Markets for Experimental Research: Amazon.com's Mechanical Turk , 2012, Political Analysis.

[86]  Miroslav Sirota,et al.  The effect of iconicity of visual displays on statistical reasoning: evidence in favor of the null hypothesis , 2014, Psychonomic bulletin & review.