Emotion Recognition from Audio and Visual Data using F-score based Fusion

Emotion recognition has been one of the cornerstones of human-computer interaction. Although decades of work has attacked the problem of automatic emotion recognition from either audio or video signals, the fusion of the two modalities is more recent. In this paper, we aim to tackle the problem when both audio and video data are available in a synchronized manner. We address the six basic human emotions, namely, anger, disgust, fear, happiness, sadness, and surprise. We employ an automatic face tracker to extract the different facial points of interest from a video. We then compute feature vectors for each video frame using distances and angles between the tracked points. For audio data, we use the pitch, energy and MFCC to derive feature vectors for each window as well as the entire audio signal. We use two standard techniques, GMM-based HMM and SVM, as the base classifiers. We then design a novel fusion method using the F-score of the base classifiers. We first demonstrate that our fusion approach can increase the accuracy of the base classifiers by as much as 5%. Finally, we show that our fusion-based bi-modal emotion recognition method achieves an overall accuracy of 54% on a publicly available database, which is an improvement upon the current state-of-the-art by 9%.

[1]  Björn W. Schuller,et al.  Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles , 2005, INTERSPEECH.

[2]  Benoit Huet,et al.  Toward emotion indexing of multimedia excerpts , 2008, 2008 International Workshop on Content-Based Multimedia Indexing.

[3]  Gwen Littlewort,et al.  Real Time Face Detection and Facial Expression Recognition: Development and Applications to Human Computer Interaction. , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[4]  Jon Atli Benediktsson,et al.  The effect of correlation on the accuracy of the combined classifier in decision level fusion , 2000, IGARSS 2000. IEEE 2000 International Geoscience and Remote Sensing Symposium. Taking the Pulse of the Planet: The Role of Remote Sensing in Managing the Environment. Proceedings (Cat. No.00CH37120).

[5]  Maja Pantic,et al.  Facial action recognition for facial expression analysis from static face images , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[6]  Zhihong Zeng,et al.  Audio-visual affect recognition through multi-stream fused HMM for HCI , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[7]  Ching Y. Suen,et al.  Application of majority voting to pattern recognition: an analysis of its behavior and performance , 1997, IEEE Trans. Syst. Man Cybern. Part A.

[8]  Ioannis Pitas,et al.  The eNTERFACE’05 Audio-Visual Emotion Database , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[9]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[10]  Takeo Kanade,et al.  Evaluation of Gabor-wavelet-based facial action unit recognition in image sequences of increasing complexity , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[11]  Björn W. Schuller,et al.  Speaker Independent Speech Emotion Recognition by Ensemble Classification , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[12]  Oh-Wook Kwon,et al.  EMOTION RECOGNITION BY SPEECH SIGNAL , 2003 .

[13]  Robert P. W. Duin,et al.  Limits on the majority vote accuracy in classifier fusion , 2003, Pattern Analysis & Applications.

[14]  Beat Fasel,et al.  Automati Fa ial Expression Analysis: A Survey , 1999 .

[15]  Sridha Sridharan,et al.  Person-independent facial expression detection using Constrained Local Models , 2011, Face and Gesture 2011.

[16]  Simon Lucey,et al.  Investigating Spontaneous Facial Action Recognition through AAM Representations of the Face , 2007 .

[17]  Maja Pantic,et al.  Detecting facial actions and their temporal segments in nearly frontal-view face image sequences , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[18]  Gang Wei,et al.  Speech emotion recognition based on HMM and SVM , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[19]  Takeo Kanade,et al.  Recognizing Action Units for Facial Expression Analysis , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[21]  Zhihong Zeng,et al.  Audio-Visual Affect Recognition , 2007, IEEE Transactions on Multimedia.

[22]  Laurence Devillers,et al.  Real-life emotions detection with lexical and paralinguistic cues on human-human call center dialogs , 2006, INTERSPEECH.

[23]  Dae-Jong Lee,et al.  Emotion recognition from the facial image and speech signal , 2003, SICE 2003 Annual Conference (IEEE Cat. No.03TH8734).

[24]  Takeo Kanade,et al.  The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[25]  Léon J. M. Rothkrantz,et al.  Semantic Audiovisual Data Fusion for Automatic Emotion Recognition , 2015 .

[26]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[27]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[29]  Ling Guan,et al.  Recognizing human emotion from audiovisual information , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[30]  Benoit Huet,et al.  Towards multimodal emotion recognition: a new approach , 2010, CIVR '10.

[31]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[32]  Michael J. Lyons,et al.  Coding facial expressions with Gabor wavelets , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[33]  Michael J. Lyons,et al.  Classifying Facial Attributes Using a 2-D Gabor Wavelet and Discriminant Analysis , 2000 .

[34]  Yuxiao Hu,et al.  Audio-Visual Spontaneous Emotion Recognition , 2007, Artifical Intelligence for Human Computing.

[35]  Gerhard Rigoll,et al.  Bimodal fusion of emotional data in an automotive environment , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[36]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .