A 3-D Audio-Visual Corpus of Affective Communication

Communication between humans deeply relies on the capability of expressing and recognizing feelings. For this reason, research on human-machine interaction needs to focus on the recognition and simulation of emotional states, prerequisite of which is the collection of affective corpora. Currently available datasets still represent a bottleneck for the difficulties arising during the acquisition and labeling of affective data. In this work, we present a new audio-visual corpus for possibly the two most important modalities used by humans to communicate their emotional states, namely speech and facial expression in the form of dense dynamic 3-D face geometries. We acquire high-quality data by working in a controlled environment and resort to video clips to induce affective states. The annotation of the speech signal includes: transcription of the corpus text into the phonological representation, accurate phone segmentation, fundamental frequency extraction, and signal intensity estimation of the speech signals. We employ a real-time 3-D scanner to acquire dense dynamic facial geometries and track the faces throughout the sequences, achieving full spatial and temporal correspondences. The corpus is a valuable tool for applications like affective visual speech synthesis or view-independent facial expression recognition.

[1]  Kostas Karpouzis,et al.  The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data , 2007, ACII.

[2]  Kostas Karpouzis,et al.  Manual Annotation and Automatic Image Processing of Multimodal Emotional Behaviors in TV Interviews , 2006, AIAI.

[3]  Harald Romsdorfer,et al.  Polyglot text to speech synthesis: text analysis & prosody control , 2009 .

[4]  Lijun Yin,et al.  A high-resolution 3D dynamic facial expression database , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[5]  Nicu Sebe,et al.  Authentic facial expression analysis , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[6]  Rosalind W. Picard Toward computers that recognize and respond to user emotion , 2000, IBM Syst. J..

[7]  Cynthia Whissell,et al.  THE DICTIONARY OF AFFECT IN LANGUAGE , 1989 .

[8]  K. Scherer,et al.  Acoustic profiles in vocal emotion expression. , 1996, Journal of personality and social psychology.

[9]  F. Hesse,et al.  Relative effectiveness and validity of mood induction procedures : a meta-analysis , 1996 .

[10]  Roddy Cowie,et al.  Beyond emotion archetypes: Databases for emotion modelling using neural networks , 2005, Neural Networks.

[11]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[12]  Klaus R. Scherer,et al.  Using Actor Portrayals to Systematically Study Multimodal Emotion Expression: The GEMEP Corpus , 2007, ACII.

[13]  Roddy Cowie,et al.  Emotional speech: Towards a new generation of databases , 2003, Speech Commun..

[14]  J. Gross,et al.  Emotion elicitation using films , 1995 .

[15]  E. Velten A laboratory task for induction of mood states. , 1968, Behaviour research and therapy.

[16]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[17]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[19]  P. Ekman,et al.  Constants across cultures in the face and emotion. , 1971, Journal of personality and social psychology.

[20]  S. Demleitner [Communication without words]. , 1997, Pflege aktuell.

[21]  Frédéric H. Pighin,et al.  Expressive speech-driven facial animation , 2005, TOGS.

[22]  M. Bradley,et al.  Picture media and emotion: effects of a sustained affective context. , 1996, Psychophysiology.

[23]  Gérard Bailly,et al.  Generating prosodic attitudes in French: Data, model and evaluation , 2001, Speech Commun..

[24]  Ulrich Trk The technical processing in smartkom data collection: a case study , 2001, INTERSPEECH.

[25]  Lawrence S. Chen,et al.  Joint processing of audio-visual information for the recognition of emotional expressions in human-computer interaction , 2000 .

[26]  Luc Van Gool,et al.  Fast 3D Scanning with Automatic Motion Compensation , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Marc Schröder,et al.  Expressive Speech Synthesis: Past, Present, and Possible Futures , 2009, Affective Information Processing.

[28]  David M. Clark,et al.  On the induction of depressed mood in the laboratory: Evaluation and comparison of the velten and musical procedures , 1983 .

[29]  Roddy Cowie,et al.  Describing the emotional states that are expressed in speech , 2003, Speech Commun..

[30]  Li Zhang,et al.  Dynamic, expressive speech animation from a single mesh , 2007, SCA '07.

[31]  R. Craggs,et al.  A two dimensional annotation scheme for emotion in dialogue , 2004 .

[32]  Luc Van Gool,et al.  Face/Off: live facial puppetry , 2009, SCA '09.

[33]  Jean-Claude Martin,et al.  Collection and Annotation of a Corpus of Human-Human Multimodal Interactions: Emotion and Others Anthropomorphic Characteristics , 2007, ACII.

[34]  Hatice Gunes,et al.  How to distinguish posed from spontaneous smiles using geometric features , 2007, ICMI '07.

[35]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.