Human Perception of Audio-Visual Synthetic Character Emotion Expression in the Presence of Ambiguous and Conflicting Information

Computer simulated avatars and humanoid robots have an increasingly prominent place in today's world. Acceptance of these synthetic characters depends on their ability to properly and recognizably convey basic emotion states to a user population. This study presents an analysis of the interaction between emotional audio (human voice) and video (simple animation) cues. The emotional relevance of the channels is analyzed with respect to their effect on human perception and through the study of the extracted audio-visual features that contribute most prominently to human perception. As a result of the unequal level of expressivity across the two channels, the audio was shown to bias the perception of the evaluators. However, even in the presence of a strong audio bias, the video data were shown to affect human perception. The feature sets extracted from emotionally matched audio-visual displays contained both audio and video features while feature sets resulting from emotionally mismatched audio-visual displays contained only audio information. This result indicates that observers integrate natural audio cues and synthetic video cues only when the information expressed is in congruence. It is therefore important to properly design the presentation of audio-visual cues as incorrect design may cause observers to ignore the information conveyed in one of the channels.

[1]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[2]  M. Bradley,et al.  Measuring emotion: the Self-Assessment Manikin and the Semantic Differential. , 1994, Journal of behavior therapy and experimental psychiatry.

[3]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[4]  Yonghong Yan,et al.  Universal speech tools: the CSLU toolkit , 1998, ICSLP.

[5]  Ian H. Witten,et al.  Weka: Practical machine learning tools and techniques with Java implementations , 1999 .

[6]  Jyrki Tuomainen,et al.  The combined perception of emotion from voice and face: early interaction revealed by human electric brain responses , 1999, Neuroscience Letters.

[7]  J. Vroomen,et al.  The perception of emotions by ear and by eye , 2000 .

[8]  D. Massaro,et al.  Fuzzy logical model of bimodal emotion perception: Comment on “The perception of emotions by ear and by eye” by de Gelder and Vroomen , 2000 .

[9]  Frédéric Gosselin,et al.  Bubbles: a technique to reveal the use of information in recognition tasks , 2001, Vision Research.

[10]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[11]  E. Vesterinen,et al.  Affective Computing , 2009, Encyclopedia of Biometrics.

[12]  Randall W. Hill,et al.  Toward a New Generation of Virtual Humans for Interactive Experiences , 2002, IEEE Intell. Syst..

[13]  Oudeyer Pierre-Yves,et al.  The production and recognition of emotions in speech: features and algorithms , 2003 .

[14]  Cynthia Breazeal,et al.  Emotion and sociable humanoid robots , 2003, Int. J. Hum. Comput. Stud..

[15]  P. Bertelson,et al.  Multisensory integration, perception and ecological validity , 2003, Trends in Cognitive Sciences.

[16]  Pierre-Yves Oudeyer,et al.  The production and recognition of emotions in speech: features and algorithms , 2003, Int. J. Hum. Comput. Stud..

[17]  Zhigang Deng,et al.  Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.

[18]  J. Hietanen,et al.  Evidence for the integration of audiovisual emotional information at the perceptual level of processing , 2004 .

[19]  Stacy Marsella,et al.  A domain-independent framework for modeling emotion , 2004, Cognitive Systems Research.

[20]  Zhigang Deng,et al.  An acoustic study of emotions expressed in speech , 2004, INTERSPEECH.

[21]  Gwen Littlewort,et al.  Recognizing facial expression: machine learning and application to spontaneous behavior , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[22]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[23]  Gang Wei,et al.  Speech emotion recognition based on HMM and SVM , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[24]  H. Meeren,et al.  Rapid perceptual integration of facial expression and emotional body language. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Elisabeth André,et al.  Comparing Feature Sets for Acted and Spontaneous Speech in View of Automatic Emotion Recognition , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[26]  Changchun Liu,et al.  An empirical study of machine learning techniques for affect recognition in human–robot interaction , 2006, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[27]  Carlo Drioli,et al.  Voice GMM modelling for FESTIVAL/MBROLA emotive TTS synthesis , 2006, INTERSPEECH.

[28]  A. Hanjalic,et al.  Extracting moods from pictures and sounds: towards truly personalized TV , 2006, IEEE Signal Processing Magazine.

[29]  Sascha Fagel Emotional McGurk Effect , 2006 .

[30]  R. Dolan,et al.  The Kuleshov Effect: the influence of contextual framing on emotional attributions. , 2006, Social cognitive and affective neuroscience.

[31]  Randall W. Hill,et al.  Toward Virtual Humans , 2006, AI Mag..

[32]  Carlos Busso,et al.  Joint Analysis of the Emotional Fingerprint in the Face and Speech: A single subject study , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[33]  Shrikanth S. Narayanan,et al.  Primitives-based evaluation and estimation of emotions in speech , 2007, Speech Commun..

[34]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[35]  Shrikanth S. Narayanan,et al.  Recognition for synthesis: Automatic parameter selection for resynthesis of emotional speech from neutral speech , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  Maja J. Mataric,et al.  Joint-processing of audio-visual signals in human perception of conflicting synthetic character emotions , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[37]  Maja J. Mataric,et al.  Human perception of synthetic character emotions in the presence of conflicting and congruent vocal and facial expressions , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38]  Maja J. Mataric,et al.  Selection of Emotionally Salient Audio-Visual Features for Modeling Human Evaluations of Synthetic Character Emotion Displays , 2008, 2008 Tenth IEEE International Symposium on Multimedia.

[39]  K. Kroschel,et al.  RULE-BASED EMOTION CLASSIFICATION USING ACOUSTIC FEATURES , .