论文信息 - Fusion of classifier predictions for audio-visual emotion recognition

Fusion of classifier predictions for audio-visual emotion recognition

In this paper is presented a novel multimodal emotion recognition system which is based on the analysis of audio and visual cues. MFCC-based features are extracted from the audio channel and facial landmark geometric relations are computed from visual data. Both sets of features are learnt separately using state-of-the-art classifiers. In addition, we summarise each emotion video into a reduced set of key-frames, which are learnt in order to visually discriminate emotions by means of a Convolutional Neural Network. Finally, confidence outputs of all classifiers from all modalities are used to define a new feature space to be learnt for final emotion prediction, in a late fusion/stacking fashion. The conducted experiments on eNTERFACE'05 database show significant performance improvements of our proposed system in comparison to state-of-the-art approaches.

[1] L. Rothkrantz. Multimodal recognition of emotions in car environments , 2009 .

[2] Nasrollah Moghaddam Charkari,et al. Multimodal information fusion application to human emotion recognition from face and speech , 2010, Multimedia Tools and Applications.

[3] M. Bartlett,et al. Machine Analysis of Facial Expressions , 2007 .

[4] Thambipillai Srikanthan,et al. Adaptive Window Strategy for High-Speed and Robust KLT Feature Tracker , 2015, PSIVT.

[5] Marwan Mattar,et al. Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[6] Benoit Huet,et al. Towards multimodal emotion recognition: a new approach , 2010, CIVR '10.

[7] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Maja Pantic,et al. Automatic Analysis of Facial Expressions: The State of the Art , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[9] P. Jackson,et al. Multimodal Emotion Recognition , 2010 .

[10] Sergio Escalera,et al. Non-verbal communication analysis in Victim-Offender Mediations , 2014, Pattern Recognit. Lett..

[11] Jean-Luc Dugelay,et al. 3D Assisted Face Recognition: Dealing With Expression Variations , 2014, IEEE Transactions on Information Forensics and Security.

[12] Arnab Bhattacharya,et al. Emotion Recognition from Audio and Visual Data using F-score based Fusion , 2014, CODS.

[13] I. Iriondo,et al. Comparison between decision-level and feature-level fusion of acoustic and linguistic features for spontaneous emotion recognition , 2012, 7th Iberian Conference on Information Systems and Technologies (CISTI 2012).

[14] Dwight L. Bolinger,et al. Intonation and Its Uses: Melody in Grammar and Discourse , 1989 .

[15] Andrea Cavallaro,et al. Automatic Analysis of Facial Affect: A Survey of Registration, Representation, and Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16] Danilo De Rossi,et al. Designing and Evaluating a Social Gaze-Control System for a Humanoid Robot , 2014, IEEE Transactions on Human-Machine Systems.

[17] Ting Wu,et al. Survey of the Facial Expression Recognition Research , 2012, BICS.

[18] Dongmei Jiang,et al. Audio Visual Emotion Recognition Based on Triple-Stream Dynamic Bayesian Network Models , 2011, ACII.

[19] C. P. Sumathi,et al. A Study of Techniques for Facial Detection and Expression Classification , 2014 .

[20] Tatsuo Arai,et al. Applicability of Equilibrium Theory of Intimacy to Non-Verbal Interaction with Robots: Multi-Channel Approach Using Duration of Gazing and Distance Between a Human Subject and Robot , 2013, J. Robotics Mechatronics.

[21] Björn W. Schuller,et al. Low-Level Fusion of Audio, Video Feature for Multi-Modal Emotion Recognition , 2008, VISAPP.

[22] Ji-Yong Lee,et al. Providing services using network-based humanoids in a home environment , 2011, IEEE Transactions on Consumer Electronics.

[23] Christopher Joseph Pal,et al. EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.

[24] Gholamreza Anbarjafari,et al. Expression Recognition by Using Facial and Vocal Expressions , 2014, VL@COLING.

[25] Luc Van Gool,et al. Deep Expectation of Real and Apparent Age from a Single Image Without Facial Landmarks , 2016, International Journal of Computer Vision.

[26] Yiannis Kompatsiaris,et al. Proceedings of the ACM International Conference on Image and Video Retrieval , 2009, CIVR 2009.

[27] Cigdem Eroglu Erdem,et al. Multimodal emotion recognition with automatic peak frame selection , 2014, 2014 IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA) Proceedings.

[28] Hyun Seung Yang,et al. An efficient face detection based on color-filtering and its application to smart devices , 2013, Multimedia Tools and Applications.

[29] Ioannis Pitas,et al. The eNTERFACE’05 Audio-Visual Emotion Database , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[30] Jonghyun Choi,et al. Multi-Directional Multi-Level Dual-Cross Patterns for Robust Face Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31] Peter W. McOwan,et al. A real-time automated system for the recognition of human facial expressions , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[32] Ling Guan,et al. Kernel Cross-Modal Factor Analysis for Information Fusion With Application to Bimodal Emotion Recognition , 2012, IEEE Transactions on Multimedia.

[33] Rok Gajsek,et al. Multi-modal Emotion Recognition Using Canonical Correlations and Acoustic Features , 2010, 2010 20th International Conference on Pattern Recognition.

[34] Albert Ali Salah,et al. Combining Facial Dynamics With Appearance for Age Estimation , 2015, IEEE Transactions on Image Processing.

[35] Yau-Hwang Kuo,et al. Learning collaborative decision-making parameters for multimodal emotion recognition , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).