CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset

People convey their emotional state in their face and voice. We present an audio-visual dataset uniquely suited for the study of multi-modal emotion expression and perception. The dataset consists of facial and vocal emotional expressions in sentences spoken in a range of basic emotional states (happy, sad, anger, fear, disgust, and neutral). 7,442 clips of 91 actors with diverse ethnic backgrounds were rated by multiple raters in three modalities: audio, visual, and audio-visual. Categorical emotion labels and real-value intensity values for the perceived emotion were collected using crowd-sourcing from 2,443 raters. The human recognition of intended emotion for the audio-only, visual-only, and audio-visual data are 40.9, 58.2 and 63.6 percent respectively. Recognition rates are highest for neutral, followed by happy, anger, disgust, fear, and sad. Average intensity levels of emotion are rated highest for visual-only perception. The accurate recognition of disgust and fear requires simultaneous audio-visual cues, while anger and happiness can be well recognized based on evidence from a single modality. The large dataset we introduce can be used to probe other questions concerning the audio-visual perception of emotion.

[1]  Maja Pantic,et al.  Web-based database for facial expression analysis , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[2]  F. Gosselin,et al.  Audio-visual integration of emotion expression , 2008, Brain Research.

[3]  K E Cummings,et al.  Analysis of the glottal excitation of emotionally styled and stressed speech. , 1995, The Journal of the Acoustical Society of America.

[4]  Ben Carterette,et al.  An Analysis of Assessor Behavior in Crowdsourced Preference Judgments , 2010 .

[5]  Stefan Steidl,et al.  Automatic classification of emotion related user states in spontaneous children's speech , 2009 .

[6]  Susanne Burger,et al.  The ISL meeting corpus: the impact of meeting type on speech style , 2002, INTERSPEECH.

[7]  K. Scherer,et al.  Introducing the Geneva Multimodal Emotion Portrayal (GEMEP) corpus , 2010 .

[8]  Iain R. Murray,et al.  Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[9]  Gianluca Demartini,et al.  Mechanical Cheat: Spamming Schemes and Adversarial Techniques on Crowdsourcing Platforms , 2012, CrowdSearch.

[10]  Phuoc Tran-Gia,et al.  Cost-Optimal Validation Mechanisms and Cheat-Detection for Crowdsourcing Platforms , 2011, 2011 Fifth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.

[11]  P. Lieberman,et al.  Fundamental frequency of phonation and perceived emotional stress. , 1997, The Journal of the Acoustical Society of America.

[12]  P. Ekman An argument for basic emotions , 1992 .

[13]  Chun Chen,et al.  CHAD: A Chinese Affective Database , 2005, ACII.

[14]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[15]  Frank Schneider,et al.  Incongruence effects in crossmodal emotional integration , 2011, NeuroImage.

[16]  L. de Silva,et al.  Facial emotion recognition using multi-modal information , 1997, Proceedings of ICICS, 1997 International Conference on Information, Communications and Signal Processing. Theme: Trends in Information Systems Engineering and Wireless Multimedia Communications (Cat..

[17]  Matthias Hirth,et al.  Cheat-Detection Mechanisms for Crowdsourcing , 2012 .

[18]  Kimberly A. Neuendorf,et al.  Reliability for Content Analysis , 2010 .

[19]  J. Vroomen,et al.  The perception of emotions by ear and by eye , 2000 .

[20]  Idoia Cearreta,et al.  Validating a Multilingual and Multimodal Affective Database , 2007, HCI.

[21]  Yi-Ping Phoebe Chen,et al.  Acoustic feature selection for automatic emotion recognition from speech , 2009, Inf. Process. Manag..

[22]  K. Mathiak,et al.  Multisensory emotions: perception, combination and underlying neural processes , 2012, Reviews in the neurosciences.

[23]  Arjen P. de Vries,et al.  Increasing cheat robustness of crowdsourcing tasks , 2013, Information Retrieval.

[24]  K. Krippendorff Reliability in Content Analysis: Some Common Misconceptions and Recommendations , 2004 .

[25]  Mohammad Soleymani,et al.  A Multimodal Database for Affect Recognition and Implicit Tagging , 2012, IEEE Transactions on Affective Computing.

[26]  K. Scherer,et al.  Emotion Inferences from Vocal Expression Correlate Across Languages and Cultures , 2001 .

[27]  Kornel Laskowski,et al.  Emotion recognition in spontaneous speech using GMMs , 2006, INTERSPEECH.

[28]  K. Scherer,et al.  Acoustic profiles in vocal emotion expression. , 1996, Journal of personality and social psychology.

[29]  P. Ekman The argument and evidence about universals in facial expressions of emotion. , 1989 .

[30]  Nirbhay N. Singh,et al.  Facial Expressions of Emotion , 1998 .

[31]  Paul Ekman,et al.  Facial Expressions of Emotion: New Findings, New Questions , 1992 .

[32]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[33]  Gary J. Robertson,et al.  Wide‐Range Achievement Test , 2010 .

[34]  Wei Wu,et al.  GMM Supervector Based SVM with Spectral Features for Speech Emotion Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[35]  K. Scherer,et al.  Vocal expression of emotion. , 2003 .

[36]  U. Hess,et al.  Cross-Cultural Emotion Recognition among Canadian Ethnic Groups , 2005 .

[37]  Margaret McRorie,et al.  The Belfast Induced Natural Emotion Database , 2012, IEEE Transactions on Affective Computing.

[38]  Alice J. O'Toole,et al.  A video database of moving faces and people , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Kostas Karpouzis,et al.  The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data , 2007, ACII.

[40]  Lawrence S. Chen,et al.  Joint processing of audio-visual information for the recognition of emotional expressions in human-computer interaction , 2000 .

[41]  Shrikanth S. Narayanan,et al.  Using emotional noise to uncloud audio-visual emotion perceptual evaluation , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[42]  Zhigang Deng,et al.  Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.

[43]  Sophie K. Scott,et al.  Cross-cultural recognition of basic emotions through nonverbal emotional vocalizations , 2010, Proceedings of the National Academy of Sciences.

[44]  Roddy Cowie,et al.  Beyond emotion archetypes: Databases for emotion modelling using neural networks , 2005, Neural Networks.

[45]  Klaus Krippendorff,et al.  Estimating the Reliability, Systematic Error and Random Error of Interval Data , 1970 .

[46]  Simon B. Eickhoff,et al.  Crossmodal Interactions in Audiovisual Emotion Processing Veronika I. Studies of Audiovisual Integration Have Predominantly Assessed the Neural Correlates of Audiovisual Speech Perception (beauchamp Unimodal Emotional Processing in Turn Has Been Extensively Studied, with a High Proportion of Studies , 2022 .

[47]  Maja J. Mataric,et al.  Human Perception of Audio-Visual Synthetic Character Emotion Expression in the Presence of Ambiguous and Conflicting Information , 2009, IEEE Transactions on Multimedia.

[48]  T. Kircher,et al.  Supramodal Representation of Emotions , 2011, The Journal of Neuroscience.

[49]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[50]  Roddy Cowie,et al.  Emotional speech: Towards a new generation of databases , 2003, Speech Commun..

[51]  Takeo Kanade,et al.  Comprehensive database for facial expression analysis , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[52]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[53]  Sabine Buchholz,et al.  Crowdsourcing Preference Tests, and How to Detect Cheating , 2011, INTERSPEECH.

[54]  Elmar Nöth,et al.  Private emotions versus social interaction: a data-driven approach towards analysing emotion in speech , 2008, User Modeling and User-Adapted Interaction.

[55]  NarayananShrikanth,et al.  Human perception of audio-visual synthetic character emotion expression in the presence of ambiguous and conflicting information , 2009 .

[56]  Ruben C Gur,et al.  Validation of affective and neutral sentence content for prosodic testing , 2008, Behavior research methods.