Audio-Visual Emotion Recognition in Video Clips

This paper presents a multimodal emotion recognition system, which is based on the analysis of audio and visual cues. From the audio channel, Mel-Frequency Cepstral Coefficients, Filter Bank Energies and prosodic features are extracted. For the visual part, two strategies are considered. First, facial landmarks’ geometric relations, i.e., distances and angles, are computed. Second, we summarize each emotional video into a reduced set of key-frames, which are taught to visually discriminate between the emotions. In order to do so, a convolutional neural network is applied to key-frames summarizing videos. Finally, confidence outputs of all the classifiers from all the modalities are used to define a new feature space to be learned for final emotion label prediction, in a late fusion/stacking fashion. The experiments conducted on the SAVEE, eNTERFACE’05, and RML databases show significant performance improvements by our proposed system in comparison to current alternatives, defining the current state-of-the-art in all three databases.

[1]  Dwight L. Bolinger,et al.  Intonation and Its Uses: Melody in Grammar and Discourse , 1989 .

[2]  Yueting Zhuang,et al.  Adaptive key frame extraction using unsupervised clustering , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[3]  A. Young,et al.  Configural information in facial expression perception. , 2000, Journal of experimental psychology. Human perception and performance.

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[6]  Shogo Muramatsu,et al.  Video key frame selection by clustering wavelet coefficients , 2004, 2004 12th European Signal Processing Conference.

[7]  Jason Sheng-Hong Tsai,et al.  A Key Frame Selection-Based Facial Expression Recognition System , 2006, First International Conference on Innovative Computing, Information and Control - Volume I (ICICIC'06).

[8]  Peter W. McOwan,et al.  A real-time automated system for the recognition of human facial expressions , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[9]  M. Bartlett,et al.  Machine Analysis of Facial Expressions , 2007 .

[10]  Björn W. Schuller,et al.  Low-Level Fusion of Audio, Video Feature for Multi-Modal Emotion Recognition , 2008, VISAPP.

[11]  Alan F. Smeaton,et al.  Investigating keyframe selection methods in the novel domain of passively captured visual lifelogs , 2008, CIVR '08.

[12]  L. Rothkrantz Multimodal recognition of emotions in car environments , 2009 .

[13]  Nasrollah Moghaddam Charkari,et al.  Multimodal information fusion application to human emotion recognition from face and speech , 2010, Multimedia Tools and Applications.

[14]  Benoit Huet,et al.  Towards multimodal emotion recognition: a new approach , 2010, CIVR '10.

[15]  Sunil Kumar Kopparapu,et al.  Choice of Mel filter bank in computing MFCC of a resampled speech , 2010, 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010).

[16]  Irina E. Bocharova Compression for Multimedia , 2010 .

[17]  Rok Gajsek,et al.  Multi-modal Emotion Recognition Using Canonical Correlations and Acoustic Features , 2010, 2010 20th International Conference on Pattern Recognition.

[18]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[19]  Dongmei Jiang,et al.  Audio Visual Emotion Recognition Based on Triple-Stream Dynamic Bayesian Network Models , 2011, ACII.

[20]  Ji-Yong Lee,et al.  Providing services using network-based humanoids in a home environment , 2011, IEEE Transactions on Consumer Electronics.

[21]  Ting Wu,et al.  Survey of the Facial Expression Recognition Research , 2012, BICS.

[22]  Ling Guan,et al.  Kernel Cross-Modal Factor Analysis for Information Fusion With Application to Bimodal Emotion Recognition , 2012, IEEE Transactions on Multimedia.

[23]  I. Iriondo,et al.  Comparison between decision-level and feature-level fusion of acoustic and linguistic features for spontaneous emotion recognition , 2012, 7th Iberian Conference on Information Systems and Technologies (CISTI 2012).

[24]  Hyun Seung Yang,et al.  An efficient face detection based on color-filtering and its application to smart devices , 2013, Multimedia Tools and Applications.

[25]  Colin Grubb Multimodal Emotion Recognition , 2013 .

[26]  Yau-Hwang Kuo,et al.  Learning collaborative decision-making parameters for multimodal emotion recognition , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[27]  Qiang Zhang,et al.  An Efficient Method of Key-Frame Extraction Based on a Cluster Algorithm , 2013, Journal of human kinetics.

[28]  C. P. Sumathi,et al.  A Study of Techniques for Facial Detection and Expression Classification , 2014 .

[29]  Cigdem Eroglu Erdem,et al.  Multimodal emotion recognition with automatic peak frame selection , 2014, 2014 IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA) Proceedings.

[30]  M. Keshavan,et al.  Smartphone Ownership and Interest in Mobile Applications to Monitor Symptoms of Mental Health Conditions , 2014, JMIR mHealth and uHealth.

[31]  Wioleta Szwoch,et al.  Emotion Recognition for Affect Aware Video Games , 2014, IP&C.

[32]  Emily Mower Provost,et al.  Say Cheese vs. Smile: Reducing Speech-Related Variability for Facial Emotion Recognition , 2014, ACM Multimedia.

[33]  Arnab Bhattacharya,et al.  Emotion Recognition from Audio and Visual Data using F-score based Fusion , 2014, CODS.

[34]  Danilo De Rossi,et al.  Designing and Evaluating a Social Gaze-Control System for a Humanoid Robot , 2014, IEEE Transactions on Human-Machine Systems.

[35]  L. Nummenmaa,et al.  Facial expression recognition in peripheral versus central vision: role of the eyes and the mouth , 2014, Psychological research.

[36]  Thambipillai Srikanthan,et al.  Adaptive Window Strategy for High-Speed and Robust KLT Feature Tracker , 2015, PSIVT.

[37]  Christopher Joseph Pal,et al.  EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.

[38]  M. Shamim Hossain,et al.  Cloud-Assisted Speech and Face Recognition Framework for Health Monitoring , 2015, Mobile Networks and Applications.

[39]  Albert Ali Salah,et al.  Combining Facial Dynamics With Appearance for Age Estimation , 2015, IEEE Transactions on Image Processing.

[40]  Zhihan Lv,et al.  Extending touch-less interaction on vision based wearable device , 2015, 2015 IEEE Virtual Reality (VR).

[41]  S. Haq,et al.  Bimodal Human Emotion Classification in the Speaker-Dependent Scenario , 2015 .

[42]  Hugo Leonardo Rufiner,et al.  Multimodal Emotion Recognition Using Deep Networks , 2015 .

[43]  Pedro Núñez Trujillo,et al.  A Novel Multimodal Emotion Recognition Approach for Affective Human Robot Interaction , 2015, MuSRobS@IROS.

[44]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Mansour Sheikhan,et al.  Audio-visual emotion recognition using FCBF feature selection method and particle swarm optimization for fuzzy ARTMAP neural networks , 2015, Multimedia Tools and Applications.

[46]  Eduardo Coutinho,et al.  Cooperative Learning and its Application to Emotion Recognition from Speech , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[47]  Luc Van Gool,et al.  Deep Expectation of Real and Apparent Age from a Single Image Without Facial Landmarks , 2016, International Journal of Computer Vision.

[48]  Tao Mei,et al.  Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation , 2016, ICMR.

[49]  Sergio Escalera,et al.  Fusion of classifier predictions for audio-visual emotion recognition , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[50]  Gholamreza Anbarjafari,et al.  Human Head Pose Estimation on SASE Database Using Random Hough Regression Forests , 2016, VAAM/FFER@ICPR.

[51]  Gholamreza Anbarjafari,et al.  Efficiency of chosen speech descriptors in relation to emotion recognition , 2017, EURASIP Journal on Audio, Speech, and Music Processing.

[52]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[53]  Sergio Escalera,et al.  Joint Challenge on Dominant and Complementary Emotion Recognition Using Micro Emotion Features and Head-Pose Estimation: Databases , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[54]  Li-Minn Ang,et al.  A Combined Rule-Based & Machine Learning Audio-Visual Emotion Recognition Approach , 2018, IEEE Transactions on Affective Computing.