Statistical considerations for crowdsourced perceptual ratings of human speech productions

ABSTRACT Crowdsourcing has become a major tool for scholarly research since its introduction to the academic sphere in 2008. However, unlike in traditional laboratory settings, it is nearly impossible to control the conditions under which workers on crowdsourcing platforms complete tasks. In the study of communication disorders, crowdsourcing has provided a novel solution to the collection of perceptual ratings of human speech production. Such ratings allow researchers to gauge whether a treatment yields meaningful change in how human listeners' perceive disordered speech. This paper will explore some statistical considerations of crowdsourced data with specific focus on collecting perceptual ratings of human speech productions. Random effects models are applied to crowdsourced perceptual ratings collected in both a continuous and binary fashion. A simulation study is conducted to test the reliability of the proposed models under differing numbers of workers and tasks. Finally, this methodology is applied to a data set from the study of communication disorders.

[1]  Gerardo Hermosillo,et al.  Supervised learning from multiple experts: whom to trust when everyone lies a bit , 2009, ICML '09.

[2]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[3]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[4]  J. Harris,et al.  An outcomes study of cochlear implants in deaf patients. Audiologic, economic, and quality-of-life changes. , 1995, Archives of otolaryngology--head & neck surgery.

[5]  Douglas M. Bates,et al.  Linear mixed model implementation in lme4 , 2013 .

[6]  W. A. Scott,et al.  Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .

[7]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[8]  Carol Y. Espy-Wilson,et al.  Coarticulatory stability in American English /r/ , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[9]  D M Ruscello,et al.  Visual feedback in treatment of residual phonological disorders. , 1995, Journal of communication disorders.

[10]  Zhenghao Chen,et al.  Tuned Models of Peer Assessment in MOOCs , 2013, EDM.

[11]  Jon Sprouse A validation of Amazon Mechanical Turk for the collection of acceptability judgments in linguistic theory , 2010, Behavior research methods.

[12]  Pietro Perona,et al.  Online crowdsourcing: Rating annotators and obtaining cost-effective labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[13]  Cheng Cheng,et al.  Interval estimation of quantile ratios applied to anti‐cancer drug screening by xenograft experiments , 2010, Statistics in medicine.

[14]  K. Muller,et al.  An R2 statistic for fixed effects in the linear mixed model , 2008, Statistics in medicine.

[15]  M. Swan Crowdsourced Health Research Studies: An Important Emerging Complement to Clinical Trials in the Public Health Research Ecosystem , 2012, Journal of medical Internet research.

[16]  Panagiotis G. Ipeirotis Analyzing the Amazon Mechanical Turk marketplace , 2010, XRDS.

[17]  R M Dalston,et al.  Acoustic characteristics of English /w,r,l/ spoken correctly by young children and adults. , 1975, The Journal of the Acoustical Society of America.

[18]  Daren C. Brabham Crowdsourcing as a Model for Problem Solving , 2008 .

[19]  W. Marsden I and J , 2012 .

[20]  Lloyd J. Edwards,et al.  Fixed-effect variable selection in linear mixed models using R2 statistics , 2008, Comput. Stat. Data Anal..

[21]  J. Fleiss,et al.  Intraclass correlations: uses in assessing rater reliability. , 1979, Psychological bulletin.

[22]  Robin Thompson Maximum likelihood estimation of variance components , 1980 .

[23]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[24]  L D Shriberg,et al.  Developmental phonological disorders. III: Long-term speech-sound normalization. , 1994, Journal of speech and hearing research.

[25]  Tara McAllister Byun,et al.  Investigating the use of traditional and spectral biofeedback approaches to intervention for /r/ misarticulation. , 2012, American journal of speech-language pathology.

[26]  P. Richard Hahn,et al.  A Bayesian Hierarchical Model for Inferring Player Strategy Types in a Number Guessing Game , 2014, 1409.4815.

[27]  David A. Forsyth,et al.  Utility data annotation with Amazon Mechanical Turk , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[28]  Panagiotis G. Ipeirotis,et al.  Repeated labeling using multiple noisy labelers , 2012, Data Mining and Knowledge Discovery.

[29]  Anita Greenhill,et al.  How is success defined and measured in online citizen science: a case study of Zooniverse projects , 2015 .

[30]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[31]  Daphna Harel,et al.  Social, Emotional, and Academic Impact of Residual Speech Errors in School-Aged Children: A Survey Study , 2015, Seminars in Speech and Language.

[32]  Duncan J. Watts,et al.  Cooperation and Contagion in Web-Based, Networked Public Goods Experiments , 2010, SECO.

[33]  Amar Cheema,et al.  Data collection in a flat world: the strengths and weaknesses of mechanical turk samples , 2013 .

[34]  Tara S. Behrend,et al.  The viability of crowdsourcing for survey research , 2011, Behavior research methods.

[35]  K. Barton MuMIn : multi-model inference, R package version 0.12.0 , 2009 .

[36]  Adam J. Berinsky,et al.  Evaluating Online Labor Markets for Experimental Research: Amazon.com's Mechanical Turk , 2012, Political Analysis.

[37]  Panagiotis G. Ipeirotis,et al.  Quality-Based Pricing for Crowdsourced Workers , 2013 .

[38]  Benjamin Munson,et al.  The role of experience in the perception of phonetic detail in children's speech: a comparison between speech-language pathologists and clinically untrained listeners. , 2012, American journal of speech-language pathology.

[39]  Lawrence D. Shriberg,et al.  Acoustic phenotypes for speech-genetics studies: reference data for residual /з/ distortions , 2001 .

[40]  Jing Cheng,et al.  Real longitudinal data analysis for real people: Building a good enough mixed model , 2010, Statistics in medicine.

[41]  Alan Macfarlane,et al.  Social , 1994, Schizophrenia Research.

[42]  D. Bates,et al.  Fitting Linear Mixed-Effects Models Using lme4 , 2014, 1406.5823.

[43]  Todd M. Gureckis,et al.  CUNY Academic , 2016 .

[44]  Panagiotis G. Ipeirotis,et al.  Running Experiments on Amazon Mechanical Turk , 2010, Judgment and Decision Making.

[45]  Joseph Hilbe,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2009 .

[46]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[47]  Stephanie A Borrie,et al.  Use of Crowdsourcing to Assess the Ecological Validity of Perceptual-Training Paradigms in Dysarthria. , 2016, American journal of speech-language pathology.

[48]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[49]  Michael D. Buhrmester,et al.  Amazon's Mechanical Turk , 2011, Perspectives on psychological science : a journal of the Association for Psychological Science.

[50]  S. Mcleod,et al.  A systematic review of the association between childhood speech impairment and participation across the lifespan , 2009 .

[51]  H. D. Patterson,et al.  Recovery of inter-block information when block sizes are unequal , 1971 .

[52]  Elaine R. Hitchcock,et al.  Finding the experts in the crowd: Validity and reliability of crowdsourced measures of children’s gradient speech contrasts , 2017, Clinical linguistics & phonetics.

[53]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[54]  Shinichi Nakagawa,et al.  A general and simple method for obtaining R2 from generalized linear mixed‐effects models , 2013 .

[55]  J. Pratt RISK AVERSION IN THE SMALL AND IN THE LARGE11This research was supported by the National Science Foundation (grant NSF-G24035). Reproduction in whole or in part is permitted for any purpose of the United States Government. , 1964 .

[56]  Beibei Li,et al.  Designing Ranking Systems for Hotels on Travel Search Engines by Mining User-Generated and Crowd-Sourced Content , 2011, Mark. Sci..

[57]  Xi Chen,et al.  Spectral Methods Meet EM: A Provably Optimal Algorithm for Crowdsourcing , 2014, J. Mach. Learn. Res..

[58]  Miguel Angel Luengo-Oroz,et al.  Crowdsourcing Malaria Parasite Quantification: An Online Game for Analyzing Images of Infected Thick Blood Smears , 2012, Journal of medical Internet research.

[59]  D. Harville Maximum Likelihood Approaches to Variance Component Estimation and to Related Problems , 1977 .

[60]  David G. Rand,et al.  The online laboratory: conducting experiments in a real labor market , 2010, ArXiv.

[61]  J. Nelder,et al.  Hierarchical generalised linear models: A synthesis of generalised linear models, random-effect models and structured dispersions , 2001 .

[62]  G. Robinson That BLUP is a Good Thing: The Estimation of Random Effects , 1991 .

[63]  Zhenming Shun,et al.  Another Look at the Salamander Mating Data: A Modified Laplace Approximation Approach , 1997 .

[64]  Katrin Kirchhoff,et al.  Using Crowdsourcing Technology for Testing Multilingual Public Health Promotion Materials , 2012, Journal of medical Internet research.

[65]  Peter F. Halpin,et al.  Deriving gradient measures of child speech from crowdsourced ratings. , 2016, Journal of communication disorders.

[66]  Jesse Chandler,et al.  Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers , 2013, Behavior Research Methods.

[67]  P. Wakker Explaining the characteristics of the power (CRRA) utility family. , 2008, Health economics.

[68]  J. Edwards,et al.  Gradient perception of children’s productions of /s/ and /θ/: A comparative study of rating methods , 2017, Clinical linguistics & phonetics.

[69]  L D Shriberg,et al.  Developmental phonological disorders. II. Short-term speech-sound normalization. , 1994, Journal of speech and hearing research.

[70]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[71]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[72]  Robert Hagiwara WPP, No. 90: Acoustic Realizations of American /r/ as Produced by Women and Men , 1995 .

[73]  E. Maas,et al.  Random versus blocked practice in treatment for childhood apraxia of speech. , 2012, Journal of speech, language, and hearing research : JSLHR.

[74]  Pietro Perona,et al.  The Multidimensional Wisdom of Crowds , 2010, NIPS.

[75]  P. Delattre,et al.  A DIALECT STUDY OF AMERICAN R’S BY X-RAY MOTION PICTURE , 1968 .

[76]  Peter F. Halpin,et al.  Online crowdsourcing for efficient rating of speech: a validation study. , 2015, Journal of communication disorders.

[77]  Panagiotis G. Ipeirotis Demographics of Mechanical Turk , 2010 .

[78]  Honghu Liu,et al.  Goodness-of-fit measures of R 2 for repeated measures mixed effect models , 2008 .

[79]  L. Shriberg,et al.  Developmental Phonological Disorders III , 1994 .