Recognition of emotions from video using acoustic and facial features

In this paper, acoustic and facial features extracted from video are explored for recognizing emotions. The temporal variation of gray values of the pixels within eye and mouth regions is used as a feature to capture the emotion-specific knowledge from the facial expressions. Acoustic features representing spectral and prosodic information are explored for recognizing emotions from the speech signal. Autoassociative neural network models are used to capture the emotion-specific information from acoustic and facial features. The basic objective of this work is to examine the capability of the proposed acoustic and facial features in view of capturing the emotion-specific information. Further, the correlations among the feature sets are analyzed by combining the evidences at different levels. The performance of the emotion recognition system developed using acoustic and facial features is observed to be 85.71 and 88.14 %, respectively. It has been observed that combining the evidences of models developed using acoustic and facial features improved the recognition performance to 93.62 %. The performance of the emotion recognition systems developed using neural network models is compared with hidden Markov models, Gaussian mixture models and support vector machine models. The proposed features and models are evaluated on real-life emotional database, Interactive Emotional Dyadic Motion Capture database, which was recently collected at University of Southern California.

[1]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[2]  Say Wei Foo,et al.  Speech emotion recognition using hidden Markov models , 2003, Speech Commun..

[3]  Yasunari Yoshitomi,et al.  Effect of sensor fusion for recognition of emotional states using voice, face image and thermal image of face , 2000, Proceedings 9th IEEE International Workshop on Robot and Human Interactive Communication. IEEE RO-MAN 2000 (Cat. No.00TH8499).

[4]  L. Rothkrantz,et al.  Toward an affect-sensitive multimodal human-computer interaction , 2003, Proc. IEEE.

[5]  S. R. Mahadeva Prasanna,et al.  Extraction of pitch in adverse conditions , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Margaret Lech,et al.  Facial Expression Recognition Using Neural Networks and Log-Gabor Filters , 2008, 2008 Digital Image Computing: Techniques and Applications.

[7]  Angeliki Metallinou,et al.  Audio-Visual Emotion Recognition Using Gaussian Mixture Models for Face and Voice , 2008, 2008 Tenth IEEE International Symposium on Multimedia.

[8]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features , 2012, Int. J. Speech Technol..

[9]  Roddy Cowie,et al.  Automatic recognition of emotion from voice: a rough benchmark , 2000 .

[10]  K. Sreenivasa Rao,et al.  Vowel Onset Point Detection for Low Bit Rate Coded Speech , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Hatice Gunes,et al.  Automatic, Dimensional and Continuous Emotion Recognition , 2010, Int. J. Synth. Emot..

[12]  Maja Pantic,et al.  Coupled Gaussian Process Regression for Pose-Invariant Facial Expression Recognition , 2010, ECCV.

[13]  Maja J. Mataric,et al.  A Framework for Automatic Human Emotion Classification Using Emotion Profiles , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Björn W. Schuller,et al.  Context-Sensitive Learning for Enhanced Audiovisual Emotion Classification , 2012, IEEE Trans. Affect. Comput..

[15]  Bayya Yegnanarayana,et al.  Multimodal person authentication using speech, face and visual speech , 2008, Comput. Vis. Image Underst..

[16]  Zhihong Zeng,et al.  Audio–Visual Affective Expression Recognition Through Multistream Fused HMM , 2008, IEEE Transactions on Multimedia.

[17]  Maja Pantic,et al.  Fusion of audio and visual cues for laughter detection , 2008, CIVR '08.

[18]  João Paulo Papa,et al.  Spoken emotion recognition through optimum-path forest classification using glottal features , 2010, Comput. Speech Lang..

[19]  Kristian Kroschel,et al.  Audio-visual emotion recognition using an emotion space concept , 2008, 2008 16th European Signal Processing Conference.

[20]  Roddy Cowie,et al.  Acoustic correlates of emotion dimensions in view of speech synthesis , 2001, INTERSPEECH.

[21]  Tsutomu Miyasato,et al.  Bimodal Emotion Recognition by Man and Machine , 2007 .

[22]  Bayya Yegnanarayana,et al.  Analysis of autoassociative mapping neural networks , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[23]  M. Bartlett,et al.  Machine Analysis of Facial Expressions , 2007 .

[24]  Oh-Wook Kwon,et al.  EMOTION RECOGNITION BY SPEECH SIGNAL , 2003 .

[25]  Goran Martinović,et al.  Emotion Recognition System by a Neural Network Based Facial Expression Analysis , 2013 .

[26]  Ling Guan,et al.  An investigation of speech-based human emotion recognition , 2004, IEEE 6th Workshop on Multimedia Signal Processing, 2004..

[27]  Marian Stewart Bartlett,et al.  Facial expression recognition using Gabor motion energy filters , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[28]  Kishore Prahallad,et al.  AANN: an alternative to GMM for pattern recognition , 2002, Neural Networks.

[29]  Léon J. M. Rothkrantz,et al.  Semantic Audiovisual Data Fusion for Automatic Emotion Recognition , 2015 .

[30]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2009, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[32]  Gwen Littlewort,et al.  Data Mining Spontaneous Facial Behavior with Automatic Expression Coding , 2008, COST 2102 Workshop.

[33]  Ryohei Nakatsu,et al.  Emotion Recognition in Speech Using Neural Networks , 2000, Neural Computing & Applications.

[34]  Maja Pantic,et al.  Dynamics of facial expression: recognition of facial actions and their temporal segments from face profile image sequences , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[35]  Carlos Busso,et al.  Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Tiago H. Falk,et al.  Automatic speech emotion recognition using modulation spectral features , 2011, Speech Commun..

[37]  Kostas Karpouzis,et al.  Emotion recognition through facial expression analysis based on a neurofuzzy network , 2005, Neural Networks.

[38]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[39]  Tsuhan Chen,et al.  The painful face - Pain expression recognition using active appearance models , 2009, Image Vis. Comput..

[40]  Hatice Gunes,et al.  Automatic Temporal Segment Detection and Affect Recognition From Face and Body Display , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[41]  K. Sreenivasa Rao,et al.  Voice conversion by mapping the speaker-specific features using pitch synchronous approach , 2010, Comput. Speech Lang..

[42]  Maureen McGranaghan A Human Face , 1998 .

[43]  Tsuhan Chen,et al.  The painful face - Pain expression recognition using active appearance models , 2009, Image Vis. Comput..

[44]  Marc Schröder,et al.  Evaluation of Expressive Speech Synthesis With Voice Conversion and Copy Resynthesis Techniques , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[45]  Shashidhar G. Koolagudi,et al.  Characterization and recognition of emotions from speech using excitation source information , 2013, Int. J. Speech Technol..