Introducing the Weighted Trustability Evaluator for Crowdsourcing Exemplified by Speaker Likability Classification

Crowdsourcing is an arising collaborative approach applicable among many other applications to the area of language and speech processing. In fact, the use of crowdsourcing was already applied in the field of speech processing with promising results. However, only few studies investigated the use of crowdsourcing in computational paralinguistics. In this contribution, we propose a novel evaluator for crowdsourced-based ratings termed Weighted Trustability Evaluator (WTE) which is computed from the rater-dependent consistency over the test questions. We further investigate the reliability of crowdsourced annotations as compared to the ones obtained with traditional labelling procedures, such as constrained listening experiments in laboratories or in controlled environments. This comparison includes an in-depth analysis of obtainable classification performances. The experiments were conducted on the Speaker Likability Database (SLD) already used in the INTERSPEECH Challenge 2012, and the results lend further weight to the assumption that crowdsourcing can be applied as a reliable annotation source for computational paralinguistics given a sufficient number of raters and suited measurements of their reliability.

[1]  Mark D. Smucker,et al.  The Crowd vs . the Lab : A Comparison of Crowd-Sourced and University Laboratory Participant Behavior , 2011 .

[2]  Elmar Nöth,et al.  The INTERSPEECH 2012 Speaker Trait Challenge , 2012, INTERSPEECH.

[3]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[4]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[5]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[6]  Klaus Zechner,et al.  Using Amazon Mechanical Turk for Transcription of Non-Native Speech , 2010, Mturk@HLT-NAACL.

[7]  K. Kroschel,et al.  Evaluation of natural emotions using self assessment manikins , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[8]  Vikas Sindhwani,et al.  Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria , 2009, HLT-NAACL 2009.

[9]  Alexander I. Rudnicky,et al.  Using the Amazon Mechanical Turk for transcription of spoken language , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[11]  Björn Schuller,et al.  Computational Paralinguistics , 2013 .

[12]  Chris Callison-Burch,et al.  Creating Speech and Language Data With Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[13]  Chris Callison-Burch,et al.  Crowdsourcing Translation: Professional Quality from Non-Professionals , 2011, ACL.

[14]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[15]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[16]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[17]  Sarah Jane Delany,et al.  Using Crowdsourcing for Labelling Emotional Speech Assets , 2010 .

[18]  Dirk Heylen,et al.  Bridging the Gap between Social Animal and Unsocial Machine: A Survey of Social Signal Processing , 2012, IEEE Transactions on Affective Computing.

[19]  Felix Burkhardt,et al.  A Database of Age and Gender Annotated Telephone Speech , 2010, LREC.

[20]  Ed H. Chi,et al.  Crowdsourcing for Usability: Using Micro-Task Markets for Rapid, Remote, and Low-Cost User Measurements , 2007 .

[21]  Jaime G. Carbonell,et al.  Efficiently learning the accuracy of labeling sources for selective sampling , 2009, KDD.

[22]  Björn W. Schuller,et al.  "Would You Buy a Car from Me?" - On the Likability of Telephone Voices , 2011, INTERSPEECH.

[23]  Erik Marchi,et al.  Likability of human voices: A feature analysis and a neural network regression approach to automatic likability estimation , 2013, 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS).