Human emotion recognition from videos using spatio-temporal and audio features

In this paper, we present human emotion recognition systems based on audio and spatio-temporal visual features. The proposed system has been tested on audio visual emotion data set with different subjects for both genders. The mel-frequency cepstral coefficient (MFCC) and prosodic features are first identified and then extracted from emotional speech. For facial expressions spatio-temporal features are extracted from visual streams. Principal component analysis (PCA) is applied for dimensionality reduction of the visual features and capturing 97 % of variances. Codebook is constructed for both audio and visual features using Euclidean space. Then occurrences of the histograms are employed as input to the state-of-the-art SVM classifier to realize the judgment of each classifier. Moreover, the judgments from each classifier are combined using Bayes sum rule (BSR) as a final decision step. The proposed system is tested on public data set to recognize the human emotions. Experimental results and simulations proved that using visual features only yields on average 74.15 % accuracy, while using audio features only gives recognition average accuracy of 67.39 %. Whereas by combining both audio and visual features, the overall system accuracy has been significantly improved up to 80.27 %.

[1]  Louis ten Bosch,et al.  Emotions, speech and the ASR framework , 2003, Speech Commun..

[2]  Takeo Kanade,et al.  Subtly different facial expression recognition and expression intensity estimation , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[3]  Yue-Kai Huang,et al.  Visual/Acoustic Emotion Recognition , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[4]  Robert Sabourin,et al.  “One Against One” or “One Against All”: Which One is Better for Handwriting Recognition with SVMs? , 2006 .

[5]  Carlos Busso,et al.  Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  P. Ekman,et al.  Unmasking the face : a guide to recognizing emotions from facial clues , 1975 .

[7]  Kwee-Bo Sim,et al.  Emotion recognition from facial expression using hybrid-feature extraction , 2004, SICE 2004 Annual Conference.

[8]  Ling Guan,et al.  An investigation of speech-based human emotion recognition , 2004, IEEE 6th Workshop on Multimedia Signal Processing, 2004..

[9]  Shrikanth S. Narayanan,et al.  Classifying emotions in human-machine spoken dialogs , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[10]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[11]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[12]  Michael J. Black,et al.  Recognizing Facial Expressions in Image Sequences Using Local Parameterized Models of Image Motion , 1997, International Journal of Computer Vision.

[13]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[14]  Danny Crookes,et al.  Multimodal Biometric Human Recognition for Perceptual Human–Computer Interaction , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[15]  Ioannis Pitas,et al.  The eNTERFACE’05 Audio-Visual Emotion Database , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[16]  Ling Shao,et al.  Relevance feedback for real-world human action retrieval , 2012, Pattern Recognit. Lett..

[17]  L. Lamel,et al.  Emotion detection in task-oriented spoken dialogues , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[18]  Kiyoharu Aizawa,et al.  Affective Audio-Visual Words and Latent Topic Driving Model for Realizing Movie Affective Scene Classification , 2010, IEEE Transactions on Multimedia.

[19]  Mohammad Pooyan,et al.  Facial expression recognition using image orientation field in limited regions and MLP neural network , 2010, 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010).

[20]  Alex Pentland,et al.  Coding, Analysis, Interpretation, and Recognition of Facial Expressions , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Irccyn,et al.  Tenth international workshop on frontiers in handwriting recognition , 2006 .

[22]  Ling Guan,et al.  A neural network approach for human emotion recognition in speech , 2004, 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No.04CH37512).

[23]  Mann Oo. Hay Emotion recognition in human-computer interaction , 2012 .

[24]  Sanjay V. Dudul,et al.  Neural Network Classifier for Human Emotion Recognition from Facial Expressions Using Discrete Cosine Transform , 2008, 2008 First International Conference on Emerging Trends in Engineering and Technology.

[25]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[26]  Takeo Kanade,et al.  Automated facial expression recognition based on FACS action units , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[27]  Siyuan Fang,et al.  Multi-perspective Panoramas of Long Scenes , 2012, 2012 IEEE International Conference on Multimedia and Expo.