IEMOCAP: interactive emotional dyadic motion capture database

Since emotions are expressed through a combination of verbal and non-verbal channels, a joint analysis of speech and gestures is required to understand expressive human communication. To facilitate such investigations, this paper describes a new corpus named the “interactive emotional dyadic motion capture database” (IEMOCAP), collected by the Speech Analysis and Interpretation Laboratory (SAIL) at the University of Southern California (USC). This database was recorded from ten actors in dyadic sessions with markers on the face, head, and hands, which provide detailed information about their facial expressions and hand movements during scripted and spontaneous spoken communication scenarios. The actors performed selected emotional scripts and also improvised hypothetical scenarios designed to elicit specific types of emotions (happiness, anger, sadness, frustration and neutral state). The corpus contains approximately 12 h of data. The detailed motion capture information, the interactive setting to elicit authentic emotions, and the size of the database make this corpus a valuable addition to the existing databases in the community for the study and modeling of multimodal and expressive human communication.

[1]  Shrikanth S. Narayanan,et al.  Recording audio-visual emotional databases from actors : a closer look , 2008 .

[2]  L. Cronbach Coefficient alpha and the internal structure of tests , 1951 .

[3]  Shrikanth S. Narayanan,et al.  An analysis of multimodal cues of interruption in dyadic spoken interactions , 2008, INTERSPEECH.

[4]  Julia Hirschberg,et al.  A Framework for Eliciting Emotional Speech: Capitalizing on the Actor’s Process , 2006 .

[5]  K. Scherer,et al.  Experiencing emotion : a cross-cultural study , 1986 .

[6]  K. S. Arun,et al.  Least-Squares Fitting of Two 3-D Point Sets , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Carlos Busso,et al.  Interrelation Between Speech and Facial Gestures in Emotional Utterances: A Single Subject Study , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Algirdas Pakstas,et al.  MPEG-4 Facial Animation: The Standard,Implementation and Applications , 2002 .

[9]  Lori Lamel,et al.  Challenges in real-life emotion annotation and machine learning based detection , 2005, Neural Networks.

[10]  K. Fischer,et al.  DESPERATELY SEEKING EMOTIONS OR: ACTORS, WIZARDS, AND HUMAN BEINGS , 2000 .

[11]  Mei-Yuh Hwang,et al.  The SPHINX-II speech recognition system: an overview , 1993, Comput. Speech Lang..

[12]  Carlos Busso,et al.  Joint Analysis of the Emotional Fingerprint in the Face and Speech: A single subject study , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[13]  M. Cranach,et al.  Human Ethology: Claims and Limits of a New Discipline. , 1982 .

[14]  John Swarbrooke,et al.  Case Study 18 – Las Vegas, Nevada, USA , 2007 .

[15]  Elmar Nöth,et al.  "Of all things the measure is man" automatic classification of emotions and inter-labeler consistency [speech-based emotion recognition] , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[16]  Patrick Brézillon,et al.  Lecture Notes in Artificial Intelligence , 1999 .

[17]  Steve DiPaola,et al.  Facial actions as visual cues for personality , 2006, Comput. Animat. Virtual Worlds.

[18]  Igor S. Pandzic,et al.  MPEG-4 Facial Animation , 2002 .

[19]  Klaus R. Scherer,et al.  Lost Luggage: A Field Study of Emotion–Antecedent Appraisal , 1997 .

[20]  Shrikanth S. Narayanan,et al.  Primitives-based evaluation and estimation of emotions in speech , 2007, Speech Commun..

[21]  Zhigang Deng,et al.  Rigid Head Motion in Expressive Speech Animation: Analysis and Synthesis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Zhigang Deng,et al.  Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.

[23]  Roddy Cowie,et al.  Beyond emotion archetypes: Databases for emotion modelling using neural networks , 2005, Neural Networks.

[24]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[25]  Zhigang Deng,et al.  Natural head motion synthesis driven by acoustic prosodic features , 2005, Comput. Animat. Virtual Worlds.

[26]  E. Vesterinen,et al.  Affective Computing , 2009, Encyclopedia of Biometrics.

[27]  Florian Schiel,et al.  The SmartKom Multimodal Corpus at BAS , 2002, LREC.

[28]  L. de Silva,et al.  Facial emotion recognition using multi-modal information , 1997, Proceedings of ICICS, 1997 International Conference on Information, Communications and Signal Processing. Theme: Trends in Information Systems Engineering and Wireless Multimedia Communications (Cat..

[29]  Stacy Marsella,et al.  Virtual Rapport , 2006, IVA.

[30]  M. Drenth San Juan, Puerto Rico , 2001 .

[31]  Shrikanth Narayanan,et al.  Interplay between linguistic and affective goals in facial expression during emotional utterances , 2006 .

[32]  Carlos Busso,et al.  Scripted dialogs versus improvisation: lessons learned about emotional elicitation techniques from the IEMOCAP database , 2008, INTERSPEECH.

[33]  A. Murat Tekalp,et al.  Face and 2-D mesh animation in MPEG-4 , 2000, Signal Process. Image Commun..

[34]  Roddy Cowie,et al.  Emotional speech: Towards a new generation of databases , 2003, Speech Commun..

[35]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[36]  Dimitrios Ververidis,et al.  A State of the Art Review on Emotional Speech Databases , 2003 .

[37]  Loïc Kessous,et al.  Modeling naturalistic affective states via facial and vocal expressions recognition , 2006, ICMI '06.

[38]  Roddy Cowie,et al.  Describing the emotional states that are expressed in speech , 2003, Speech Commun..

[39]  N. Amir,et al.  Analysis of an emotional speech corpus in Hebrew based on objective criteria , 2000 .

[40]  Jean-Claude Martin,et al.  Collection and Annotation of a Corpus of Human-Human Multimodal Interactions: Emotion and Others Anthropomorphic Characteristics , 2007, ACII.

[41]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[42]  K. Chang,et al.  Embodiment in conversational interfaces: Rea , 1999, CHI '99.

[43]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[44]  Carlos Busso,et al.  The expression and perception of emotions: comparing assessments of self versus others , 2008, INTERSPEECH.

[45]  Carlos Busso,et al.  Using neutral speech models for emotional speech analysis , 2007, INTERSPEECH.

[46]  E. Hall,et al.  The Hidden Dimension , 1970 .

[47]  Peter F. Driessen,et al.  Gesture-Based Affective Computing on Motion Capture Data , 2005, ACII.

[48]  Michael Kipp,et al.  ANVIL - a generic annotation tool for multimodal dialogue , 2001, INTERSPEECH.

[49]  K. Kroschel,et al.  Evaluation of natural emotions using self assessment manikins , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[50]  Jing Xiao,et al.  Automatic analysis and recognition of brow actions and head motion in spontaneous facial behavior , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[51]  Roddy Cowie,et al.  Multimodal databases of everyday emotion: facing up to complexity , 2005, INTERSPEECH.

[52]  Klaus R. Scherer,et al.  Using Actor Portrayals to Systematically Study Multimodal Emotion Expression: The GEMEP Corpus , 2007, ACII.

[53]  Richard T. Cauldwell WHERE DID THE ANGER GO? THE ROLE OF CONTEXT IN INTERPRETING EMOTION IN SPEECH , 2000 .

[54]  EmoTV 1 : Annotation of Real-life Emotions for the Specification of Multimodal Affective Interfaces , 2005 .

[55]  P. Ekman,et al.  Constants across cultures in the face and emotion. , 1971, Journal of personality and social psychology.