Acquisition of a 3D audio-visual corpus of affective speech

Communication between humans deeply relies on our capability of experiencing, expressing, and recognizing feelings. For this reason, research on human-machine interaction needs to focus on the recognition and simulation of emotional states, prerequisite of which is the collection of affective corpora. Currently available datasets still represent a bottleneck because of the difficulties arising during the acquisition and labeling of authentic affective data. In this work, we present a new audio-visual corpus for possibly the two most important modalities used by humans to communicate their emotional states, namely speech and facial expression in the form of dense dynamic 3D face geometries. We also introduce an acquisition setup for labeling the data with very little manual effort. We acquire high-quality data by working in a controlled environment and resort to video clips to induce affective states. In order to obtain the physical prosodic parameters of each utterance, the annotation process includes: transcription of the corpus text into the phonological representation, accurate phone segmentation, fundamental frequency extraction, and signal intensity estimation of the speech signals. We employ a real-time 3D scanner for the recording of dense dynamic facial geometries and track the faces throughout the sequences, achieving full spatial and temporal correspondences. The corpus is not only relevant for affective visual speech synthesis or view-independent facial expression recognition, but also for studying the correlations between audio and facial features in the context of emotional speech.

[1]  S. Lelandais,et al.  The IV2 Multimodal Biometric Database (Including Iris, 2D, 3D, Stereoscopic, and Talking Face Data), and the IV2-2007 Evaluation Campaign , 2008, 2008 IEEE Second International Conference on Biometrics: Theory, Applications and Systems.

[2]  Kostas Karpouzis,et al.  The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data , 2007, ACII.

[3]  Marc Schröder,et al.  Expressive Speech Synthesis: Past, Present, and Possible Futures , 2009, Affective Information Processing.

[4]  Cristina Conde,et al.  An Automatic 2D, 2.5D & 3D Score-Based Fusion Face Verification System , 2007, 2006 International Workshop on Computer Architecture for Machine Perception and Sensing.

[5]  Klaus R. Scherer,et al.  Using Actor Portrayals to Systematically Study Multimodal Emotion Expression: The GEMEP Corpus , 2007, ACII.

[6]  Harald Romsdorfer,et al.  Text analysis and language identification for polyglot text-to-speech synthesis , 2007, Speech Commun..

[7]  Ulrich Trk The technical processing in smartkom data collection: a case study , 2001, INTERSPEECH.

[8]  Sami Romdhani,et al.  A 3D Face Model for Pose and Illumination Invariant Face Recognition , 2009, 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.

[9]  F. Pianesi,et al.  An Italian Database of Emotional Speech and Facial Expressions , 2006 .

[10]  Harald Romsdorfer,et al.  Polyglot text to speech synthesis: text analysis & prosody control , 2009 .

[11]  Harald Romsdorfer,et al.  Phonetic labeling and segmentation of mixed-lingual prosody databases , 2005, INTERSPEECH.

[12]  Rosalind W. Picard Toward computers that recognize and respond to user emotion , 2000, IBM Syst. J..

[13]  E. Velten A laboratory task for induction of mood states. , 1968, Behaviour research and therapy.

[14]  Fabio Pianesi,et al.  DaFEx: Database of Facial Expressions , 2005, INTETAIN.

[15]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[16]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[17]  P. Ekman,et al.  Constants across cultures in the face and emotion. , 1971, Journal of personality and social psychology.

[18]  Luc Van Gool,et al.  In-hand scanning with online loop closure , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[19]  Frédéric H. Pighin,et al.  Expressive speech-driven facial animation , 2005, TOGS.

[20]  M. Bradley,et al.  Picture media and emotion: effects of a sustained affective context. , 1996, Psychophysiology.

[21]  David M. Clark,et al.  On the induction of depressed mood in the laboratory: Evaluation and comparison of the velten and musical procedures , 1983 .

[22]  Roddy Cowie,et al.  Describing the emotional states that are expressed in speech , 2003, Speech Commun..

[23]  Li Zhang,et al.  Dynamic, expressive speech animation from a single mesh , 2007, SCA '07.

[24]  R. Craggs,et al.  A two dimensional annotation scheme for emotion in dialogue , 2004 .

[25]  Luc Van Gool,et al.  Face/Off: live facial puppetry , 2009, SCA '09.

[26]  Richard Sproat,et al.  High-accuracy automatic segmentation , 1999, EUROSPEECH.

[27]  Jean-Claude Martin,et al.  Collection and Annotation of a Corpus of Human-Human Multimodal Interactions: Emotion and Others Anthropomorphic Characteristics , 2007, ACII.

[28]  Jun Wang,et al.  3D Facial Expression Recognition Based on Primitive Surface Feature Distribution , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[29]  Roddy Cowie,et al.  Emotional speech: Towards a new generation of databases , 2003, Speech Commun..

[30]  Elmar Nöth,et al.  “You Stupid Tin Box” - Children Interacting with the AIBO Robot: A Cross-linguistic Emotional Speech Corpus , 2004, LREC.

[31]  Cynthia Whissell,et al.  THE DICTIONARY OF AFFECT IN LANGUAGE , 1989 .

[32]  Ronald A. Cole,et al.  Automatic time alignment of phonemes using acoustic-phonetic information , 2000 .

[33]  Lijun Yin,et al.  A high-resolution 3D dynamic facial expression database , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[34]  N. Kalin,et al.  Emotion, plasticity, context, and regulation: perspectives from affective neuroscience. , 2000, Psychological bulletin.

[35]  Lawrence S. Chen,et al.  Joint processing of audio-visual information for the recognition of emotional expressions in human-computer interaction , 2000 .

[36]  Roddy Cowie,et al.  Beyond emotion archetypes: Databases for emotion modelling using neural networks , 2005, Neural Networks.

[37]  Luc Van Gool,et al.  Fast 3D Scanning with Automatic Motion Compensation , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Kostas Karpouzis,et al.  Manual annotation and automatic image processing of multimodal emotional behaviors: validating the annotation of TV interviews , 2007, Personal and Ubiquitous Computing.

[39]  Olga Sorkine-Hornung,et al.  On Linear Variational Surface Deformation Methods , 2008, IEEE Transactions on Visualization and Computer Graphics.

[40]  Gérard Bailly,et al.  Generating prosodic attitudes in French: Data, model and evaluation , 2001, Speech Commun..

[41]  J. Gross,et al.  Emotion elicitation using films , 1995 .

[42]  K. Scherer,et al.  Acoustic profiles in vocal emotion expression. , 1996, Journal of personality and social psychology.

[43]  John P. Lewis,et al.  Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces , 2006, IEEE Transactions on Visualization and Computer Graphics.

[44]  Jun Wang,et al.  A 3D facial expression database for facial behavior research , 2006, 7th International Conference on Automatic Face and Gesture Recognition (FGR06).

[45]  Hanspeter Pfister,et al.  Face transfer with multilinear models , 2005, ACM Trans. Graph..

[46]  Hatice Gunes,et al.  How to distinguish posed from spontaneous smiles using geometric features , 2007, ICMI '07.

[47]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .