Increasing the Reliability of Crowdsourcing Evaluations Using Online Quality Assessment

Manual annotations and transcriptions have an ever-increasing importance in areas such as behavioral signal processing, image processing, computer vision, and speech signal processing. Conventionally, this metadata has been collected through manual annotations by experts. With the advent of crowdsourcing services, the scientific community has begun to crowdsource many tasks that researchers deem tedious, but can be easily completed by many human annotators. While crowdsourcing is a cheaper and more efficient approach, the quality of the annotations becomes a limitation in many cases. This paper investigates the use of reference sets with predetermined ground-truth to monitor annotators' accuracy and fatigue, all in real-time. The reference set includes evaluations that are identical in form to the relevant questions that are collected, so annotators are blind to whether or not they are being graded on performance on a specific question. We explore these ideas on the emotional annotation of the MSP-IMPROV database. We present promising results which suggest that our system is suitable for collecting accurate annotations.

[1]  Lilly Irani,et al.  Amazon Mechanical Turk , 2018, Advances in Intelligent Systems and Computing.

[2]  Mohammad Soleymani,et al.  Crowdsourcing for Affective Annotation of Video: Development of a Viewer-reported Boredom Corpus , 2010 .

[3]  Angeliki Metallinou,et al.  Annotation and processing of continuous emotional attributes: Challenges and opportunities , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[4]  Shrikanth Narayanan,et al.  Toward Effective Automatic Recognition Systems of Emotion in Speech , 2014 .

[5]  H. Schlosberg Three dimensions of emotion. , 1954, Psychological review.

[6]  Daniel McDuff,et al.  Crowdsourcing Techniques for Affective Computing , 2015 .

[7]  Shrikanth S. Narayanan,et al.  Primitives-based evaluation and estimation of emotions in speech , 2007, Speech Commun..

[8]  Jeroen B. P. Vuurens,et al.  How Much Spam Can You Take? An Analysis of Crowdsourcing Results to Increase Accuracy , 2011 .

[9]  David G. Rand,et al.  The online laboratory: conducting experiments in a real labor market , 2010, ArXiv.

[10]  Qingming Huang,et al.  Online crowdsourcing subjective image quality assessment , 2012, ACM Multimedia.

[11]  Julia Hirschberg,et al.  Acoustic and Prosodic Correlates of Social Behavior , 2011, INTERSPEECH.

[12]  Carlos Busso,et al.  UMEME: University of Michigan Emotional McGurk Effect Data Set , 2015, IEEE Transactions on Affective Computing.

[13]  Jaime G. Carbonell,et al.  Active Learning and Crowd-Sourcing for Machine Translation , 2010, LREC.

[14]  Arjen P. de Vries,et al.  Increasing cheat robustness of crowdsourcing tasks , 2013, Information Retrieval.

[15]  Sabine Buchholz,et al.  Crowdsourcing Preference Tests, and How to Detect Cheating , 2011, INTERSPEECH.

[16]  Peter D. Turney,et al.  Emotions Evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon , 2010, HLT-NAACL 2010.

[17]  Daniel Roggen,et al.  Tagging human activities in video by crowdsourcing , 2013, ICMR.

[18]  Reza Lotfian,et al.  Building a naturalistic emotional speech corpus by retrieving expressive behaviors from existing speech corpora , 2014, INTERSPEECH.

[19]  Phuoc Tran-Gia,et al.  Cost-Optimal Validation Mechanisms and Cheat-Detection for Crowdsourcing Platforms , 2011, 2011 Fifth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.

[20]  Lori Lamel,et al.  Challenges in real-life emotion annotation and machine learning based detection , 2005, Neural Networks.

[21]  Sarah Jane Delany,et al.  Using Crowdsourcing for Labelling Emotional Speech Assets , 2010 .

[22]  Ragini Verma,et al.  CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset , 2014, IEEE Transactions on Affective Computing.

[23]  Shrikanth S. Narayanan,et al.  A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle , 2012, ACL.

[24]  Dietrich Klakow,et al.  Paragraph Acquisition and Selection for List Question Using Amazon's Mechanical Turk , 2010, LREC.

[25]  Yang Liu,et al.  Speech-Driven Animation Constrained by Appropriate Discourse Functions , 2014, ICMI.

[26]  Bill Tomlinson,et al.  Who are the crowdworkers?: shifting demographics in mechanical turk , 2010, CHI Extended Abstracts.

[27]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[28]  Elmar Nöth,et al.  "Of all things the measure is man" automatic classification of emotions and inter-labeler consistency [speech-based emotion recognition] , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[29]  J. Russell,et al.  Evidence for a three-factor theory of emotions , 1977 .

[30]  Gianluca Demartini,et al.  ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking , 2012, WWW.

[31]  Carlos Busso,et al.  Interpreting ambiguous emotional expressions , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[32]  Panagiotis G. Ipeirotis,et al.  Running Experiments on Amazon Mechanical Turk , 2010, Judgment and Decision Making.

[33]  Russell G. Congalton,et al.  Assessing the accuracy of remotely sensed data : principles and practices , 1998 .

[34]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.

[35]  Siddharth Suri,et al.  Conducting behavioral research on Amazon’s Mechanical Turk , 2010, Behavior research methods.

[36]  Rada Mihalcea,et al.  Amazon Mechanical Turk for Subjectivity Word Sense Disambiguation , 2010, Mturk@HLT-NAACL.

[37]  K. Kroschel,et al.  Evaluation of natural emotions using self assessment manikins , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[38]  Gianluca Demartini,et al.  Mechanical Cheat: Spamming Schemes and Adversarial Techniques on Crowdsourcing Platforms , 2012, CrowdSearch.

[39]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[40]  Shrikanth S. Narayanan,et al.  Using emotional noise to uncloud audio-visual emotion perceptual evaluation , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).