论文信息 - Multimodal fusion based on information gain for emotion recognition in the wild

Multimodal fusion based on information gain for emotion recognition in the wild

In this paper we present a novel approach towards multi-modal emotion recognition on a challenging dataset AFEW'16, composed of video clips labeled with the six basic emotions plus the neutral state. After a preprocessing stage, we employ different feature extraction techniques (CNN, DSIFT on face and facial ROI, geometric and audio based) and encoded frame-based features using Fisher vector representations. Next, we leverage the properties of each modality using different fusion schemes. Apart from the early-level fusion and the decision level fusion approaches, we propose a hierarchical decision level method based on information gain principles and we optimize its parameters using genetic algorithms. The experimental results prove the suitability of our method, as we obtain 53.06% validation accuracy, surpassing by 14% the baseline of 38.81% on a challenging dataset, suitable for emotion recognition in the wild.

[1] Michael J. Lyons,et al. Coding facial expressions with Gabor wavelets , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[2] Albert Ali Salah,et al. Contrasting and Combining Least Squares Based Learners for Emotion Recognition in the Wild , 2015, ICMI.

[3] Shaogang Gong,et al. Facial expression recognition based on Local Binary Patterns: A comprehensive study , 2009, Image Vis. Comput..

[4] Fernando De la Torre,et al. Supervised Descent Method and Its Applications to Face Alignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[5] Zhihong Zeng,et al. A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6] Honglak Lee,et al. Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7] Emad Barsoum,et al. Emotion recognition in the wild from videos using images , 2016, ICMI.

[8] Ayoub Al-Hamadi,et al. Effective geometric features for human emotion recognition , 2012, 2012 IEEE 11th International Conference on Signal Processing.

[9] Maja Pantic,et al. Web-based database for facial expression analysis , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[10] Björn W. Schuller,et al. AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.

[11] Min Chen,et al. Feature Level Fusion for Bimodal Facial Action Unit Recognition , 2015, 2015 IEEE International Symposium on Multimedia (ISM).

[12] Haizhou Li,et al. Audio and face video emotion recognition in the wild using deep neural networks and small datasets , 2016, ICMI.

[13] P. Ekman. Facial expression and emotion. , 1993, The American psychologist.

[14] Radu Tudor Ionescu,et al. Local Learning to Improve Bag of Visual Words Model for Facial Expression Recognition , 2013 .

[15] Kornel Laskowski,et al. Combining Efforts for Improving Automatic Classification of Emotional User States , 2006 .

[16] Qin Jin,et al. Video emotion recognition in the wild based on fusion of multimodal features , 2016, ICMI.

[17] Yuanliu Liu,et al. Video-based emotion recognition using CNN-RNN and C3D hybrid networks , 2016, ICMI.

[18] Ling Shao,et al. Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19] Shiguang Shan,et al. Combining Multiple Kernel Methods on Riemannian Manifold for Emotion Recognition in the Wild , 2014, ICMI.

[20] Stefan Wermter,et al. Face expression recognition with a 2-channel Convolutional Neural Network , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[21] Hatice Gunes,et al. Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[22] Yau-Hwang Kuo,et al. Learning collaborative decision-making parameters for multimodal emotion recognition , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[23] Florent Perronnin,et al. Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[24] Gayler, and David Hawking. Similarity-Aware Indexing for , 2009 .

[25] Thierry Pun,et al. Multimodal Emotion Recognition in Response to Videos , 2012, IEEE Transactions on Affective Computing.

[26] Nitish Srivastava,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[27] Michael Wagner,et al. Evaluating AAM fitting methods for facial expression recognition , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[28] Huaiyu Zhu. On Information and Sufficiency , 1997 .

[29] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[30] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[31] Matti Pietikäinen,et al. Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32] Terence Sim,et al. The CMU Pose, Illumination, and Expression Database , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[33] Alberto Del Bimbo,et al. A Set of Selected SIFT Features for 3D Facial Expression Recognition , 2010, 2010 20th International Conference on Pattern Recognition.

[34] Björn W. Schuller,et al. Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[35] Emily Mower Provost,et al. Wild wild emotion: a multimodal ensemble approach , 2016, ICMI.

[36] Andrew Zisserman,et al. Fisher Vector Faces in the Wild , 2013, BMVC.

[37] Muhamad Taufik Abdullah,et al. Region-Based Facial Expression Recognition in Still Images , 2013, J. Inf. Process. Syst..

[38] Andrew Zisserman,et al. Deep Face Recognition , 2015, BMVC.

[39] Andrew Zisserman,et al. A Compact and Discriminative Face Track Descriptor , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40] Don H. Johnson,et al. Symmetrizing the Kullback-Leibler Distance , 2001 .

[41] Takeo Kanade,et al. The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[42] Thomas Mensink,et al. Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[43] Hazim Kemal Ekenel,et al. Why is facial expression analysis in the wild challenging? , 2013, EmotiW '13.

[44] Razvan Pascanu,et al. Combining modality specific deep neural networks for emotion recognition in video , 2013, ICMI '13.

[45] Jesse Hoey,et al. EmotiW 2016: video and group-level emotion recognition challenges , 2016, ICMI.

[46] Maja Pantic,et al. Decision Level Fusion of Domain Specific Regions for Facial Action Recognition , 2014, 2014 22nd International Conference on Pattern Recognition.