Crowdsourcing a Normative Natural Language Dataset: A Comparison of Amazon Mechanical Turk and In-Lab Data Collection

Background Crowdsourcing has become a valuable method for collecting medical research data. This approach, recruiting through open calls on the Web, is particularly useful for assembling large normative datasets. However, it is not known how natural language datasets collected over the Web differ from those collected under controlled laboratory conditions. Objective To compare the natural language responses obtained from a crowdsourced sample of participants with responses collected in a conventional laboratory setting from participants recruited according to specific age and gender criteria. Methods We collected natural language descriptions of 200 half-minute movie clips, from Amazon Mechanical Turk workers (crowdsourced) and 60 participants recruited from the community (lab-sourced). Crowdsourced participants responded to as many clips as they wanted and typed their responses, whereas lab-sourced participants gave spoken responses to 40 clips, and their responses were transcribed. The content of the responses was evaluated using a take-one-out procedure, which compared responses to other responses to the same clip and to other clips, with a comparison of the average number of shared words. Results In contrast to the 13 months of recruiting that was required to collect normative data from 60 lab-sourced participants (with specific demographic characteristics), only 34 days were needed to collect normative data from 99 crowdsourced participants (contributing a median of 22 responses). The majority of crowdsourced workers were female, and the median age was 35 years, lower than the lab-sourced median of 62 years but similar to the median age of the US population. The responses contributed by the crowdsourced participants were longer on average, that is, 33 words compared to 28 words (P<.001), and they used a less varied vocabulary. However, there was strong similarity in the words used to describe a particular clip between the two datasets, as a cross-dataset count of shared words showed (P<.001). Within both datasets, responses contained substantial relevant content, with more words in common with responses to the same clip than to other clips (P<.001). There was evidence that responses from female and older crowdsourced participants had more shared words (P=.004 and .01 respectively), whereas younger participants had higher numbers of shared words in the lab-sourced population (P=.01). Conclusions Crowdsourcing is an effective approach to quickly and economically collect a large reliable dataset of normative natural language responses.

[1]  Luis von Ahn Games with a Purpose , 2006, Computer.

[2]  Siddharth Suri,et al.  Conducting behavioral research on Amazon’s Mechanical Turk , 2010, Behavior research methods.

[3]  Todd M. Gureckis,et al.  CUNY Academic , 2016 .

[4]  D H Brainard,et al.  The Psychophysics Toolbox. , 1997, Spatial vision.

[5]  Katrin Kirchhoff,et al.  Using Crowdsourcing Technology for Testing Multilingual Public Health Promotion Materials , 2012, Journal of medical Internet research.

[6]  Sally Okun,et al.  Patient-reported Outcomes as a Source of Evidence in Off-Label Prescribing: Analysis of Data From PatientsLikeMe , 2011, Journal of medical Internet research.

[7]  J. Cutting,et al.  Attention and the Evolution of Hollywood Film , 2010, Psychological science.

[8]  Tara S. Behrend,et al.  The viability of crowdsourcing for survey research , 2011, Behavior research methods.

[9]  Anthony G. Greenwald,et al.  E‐Research: Ethics, Security, Design, and Control in Psychological Research on the Internet , 2002 .

[10]  Peter J. Bex,et al.  Measuring Information Acquisition from Sensory Input Using Automated Scoring of Natural-Language Descriptions , 2014, PloS one.

[11]  Panagiotis G. Ipeirotis,et al.  Running Experiments on Amazon Mechanical Turk , 2010, Judgment and Decision Making.

[12]  M. Swan Crowdsourced Health Research Studies: An Important Emerging Complement to Clinical Trials in the Public Health Research Ecosystem , 2012, Journal of medical Internet research.

[13]  Zenzi M. Griffin,et al.  Observing the what and when of language production for different age groups by monitoring speakers’ eye movements , 2006, Brain and Language.

[14]  K. Nakayama,et al.  Developmental Prosopagnosia: a Window to Content-specific Face Processing This Review Comes from a Themed Issue on Cognitive Neuroscience Edited Developmental Prosopagnosia and Inferences to Functional Organization Investigating the Architecture of Face Processing through Developmental Prosopagnosia , 2022 .

[15]  Todd Lingren,et al.  Web 2.0-Based Crowdsourcing for High-Quality Gold Standard Development in Clinical Natural Language Processing , 2013, Journal of medical Internet research.

[16]  Nicholas Eriksson,et al.  Efficient Replication of over 180 Genetic Associations with Self-Reported Medical Data , 2011, PloS one.

[17]  Miguel Angel Luengo-Oroz,et al.  Crowdsourcing Malaria Parasite Quantification: An Online Game for Analyzing Images of Infected Thick Blood Smears , 2012, Journal of medical Internet research.

[18]  Bill Tomlinson,et al.  Who are the crowdworkers?: shifting demographics in mechanical turk , 2010, CHI Extended Abstracts.

[19]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[20]  M. Banaji,et al.  Psychological. , 2015, The journals of gerontology. Series B, Psychological sciences and social sciences.

[21]  J. Cummings,et al.  The Montreal Cognitive Assessment, MoCA: A Brief Screening Tool For Mild Cognitive Impairment , 2005, Journal of the American Geriatrics Society.

[22]  Dirk P. Janssen,et al.  Twice random, once mixed: Applying mixed models to simultaneously analyze random effects of language and participants , 2011, Behavior Research Methods.

[23]  Russell L Woods,et al.  Television, computer and portable display device use by people with central vision impairment , 2011, Ophthalmic & physiological optics : the journal of the British College of Ophthalmic Opticians.

[24]  Alexander I. Rudnicky,et al.  Using the Amazon Mechanical Turk for transcription of spoken language , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.