Fusion of classifier predictions for audio-visual emotion recognition

In this paper is presented a novel multimodal emotion recognition system which is based on the analysis of audio and visual cues. MFCC-based features are extracted from the audio channel and facial landmark geometric relations are computed from visual data. Both sets of features are learnt separately using state-of-the-art classifiers. In addition, we summarise each emotion video into a reduced set of key-frames, which are learnt in order to visually discriminate emotions by means of a Convolutional Neural Network. Finally, confidence outputs of all classifiers from all modalities are used to define a new feature space to be learnt for final emotion prediction, in a late fusion/stacking fashion. The conducted experiments on eNTERFACE'05 database show significant performance improvements of our proposed system in comparison to state-of-the-art approaches.

[1]  L. Rothkrantz Multimodal recognition of emotions in car environments , 2009 .

[2]  Nasrollah Moghaddam Charkari,et al.  Multimodal information fusion application to human emotion recognition from face and speech , 2010, Multimedia Tools and Applications.

[3]  M. Bartlett,et al.  Machine Analysis of Facial Expressions , 2007 .

[4]  Thambipillai Srikanthan,et al.  Adaptive Window Strategy for High-Speed and Robust KLT Feature Tracker , 2015, PSIVT.

[5]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[6]  Benoit Huet,et al.  Towards multimodal emotion recognition: a new approach , 2010, CIVR '10.

[7]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Maja Pantic,et al.  Automatic Analysis of Facial Expressions: The State of the Art , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  P. Jackson,et al.  Multimodal Emotion Recognition , 2010 .

[10]  Sergio Escalera,et al.  Non-verbal communication analysis in Victim-Offender Mediations , 2014, Pattern Recognit. Lett..

[11]  Jean-Luc Dugelay,et al.  3D Assisted Face Recognition: Dealing With Expression Variations , 2014, IEEE Transactions on Information Forensics and Security.

[12]  Arnab Bhattacharya,et al.  Emotion Recognition from Audio and Visual Data using F-score based Fusion , 2014, CODS.

[13]  I. Iriondo,et al.  Comparison between decision-level and feature-level fusion of acoustic and linguistic features for spontaneous emotion recognition , 2012, 7th Iberian Conference on Information Systems and Technologies (CISTI 2012).

[14]  Dwight L. Bolinger,et al.  Intonation and Its Uses: Melody in Grammar and Discourse , 1989 .

[15]  Andrea Cavallaro,et al.  Automatic Analysis of Facial Affect: A Survey of Registration, Representation, and Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Danilo De Rossi,et al.  Designing and Evaluating a Social Gaze-Control System for a Humanoid Robot , 2014, IEEE Transactions on Human-Machine Systems.

[17]  Ting Wu,et al.  Survey of the Facial Expression Recognition Research , 2012, BICS.

[18]  Dongmei Jiang,et al.  Audio Visual Emotion Recognition Based on Triple-Stream Dynamic Bayesian Network Models , 2011, ACII.

[19]  C. P. Sumathi,et al.  A Study of Techniques for Facial Detection and Expression Classification , 2014 .

[20]  Tatsuo Arai,et al.  Applicability of Equilibrium Theory of Intimacy to Non-Verbal Interaction with Robots: Multi-Channel Approach Using Duration of Gazing and Distance Between a Human Subject and Robot , 2013, J. Robotics Mechatronics.

[21]  Björn W. Schuller,et al.  Low-Level Fusion of Audio, Video Feature for Multi-Modal Emotion Recognition , 2008, VISAPP.

[22]  Ji-Yong Lee,et al.  Providing services using network-based humanoids in a home environment , 2011, IEEE Transactions on Consumer Electronics.

[23]  Christopher Joseph Pal,et al.  EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.

[24]  Gholamreza Anbarjafari,et al.  Expression Recognition by Using Facial and Vocal Expressions , 2014, VL@COLING.

[25]  Luc Van Gool,et al.  Deep Expectation of Real and Apparent Age from a Single Image Without Facial Landmarks , 2016, International Journal of Computer Vision.

[26]  Yiannis Kompatsiaris,et al.  Proceedings of the ACM International Conference on Image and Video Retrieval , 2009, CIVR 2009.

[27]  Cigdem Eroglu Erdem,et al.  Multimodal emotion recognition with automatic peak frame selection , 2014, 2014 IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA) Proceedings.

[28]  Hyun Seung Yang,et al.  An efficient face detection based on color-filtering and its application to smart devices , 2013, Multimedia Tools and Applications.

[29]  Ioannis Pitas,et al.  The eNTERFACE’05 Audio-Visual Emotion Database , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[30]  Jonghyun Choi,et al.  Multi-Directional Multi-Level Dual-Cross Patterns for Robust Face Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Peter W. McOwan,et al.  A real-time automated system for the recognition of human facial expressions , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[32]  Ling Guan,et al.  Kernel Cross-Modal Factor Analysis for Information Fusion With Application to Bimodal Emotion Recognition , 2012, IEEE Transactions on Multimedia.

[33]  Rok Gajsek,et al.  Multi-modal Emotion Recognition Using Canonical Correlations and Acoustic Features , 2010, 2010 20th International Conference on Pattern Recognition.

[34]  Albert Ali Salah,et al.  Combining Facial Dynamics With Appearance for Age Estimation , 2015, IEEE Transactions on Image Processing.

[35]  Yau-Hwang Kuo,et al.  Learning collaborative decision-making parameters for multimodal emotion recognition , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).