Multimodal fusion based on information gain for emotion recognition in the wild

In this paper we present a novel approach towards multi-modal emotion recognition on a challenging dataset AFEW'16, composed of video clips labeled with the six basic emotions plus the neutral state. After a preprocessing stage, we employ different feature extraction techniques (CNN, DSIFT on face and facial ROI, geometric and audio based) and encoded frame-based features using Fisher vector representations. Next, we leverage the properties of each modality using different fusion schemes. Apart from the early-level fusion and the decision level fusion approaches, we propose a hierarchical decision level method based on information gain principles and we optimize its parameters using genetic algorithms. The experimental results prove the suitability of our method, as we obtain 53.06% validation accuracy, surpassing by 14% the baseline of 38.81% on a challenging dataset, suitable for emotion recognition in the wild.

[1]  Michael J. Lyons,et al.  Coding facial expressions with Gabor wavelets , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[2]  Albert Ali Salah,et al.  Contrasting and Combining Least Squares Based Learners for Emotion Recognition in the Wild , 2015, ICMI.

[3]  Shaogang Gong,et al.  Facial expression recognition based on Local Binary Patterns: A comprehensive study , 2009, Image Vis. Comput..

[4]  Fernando De la Torre,et al.  Supervised Descent Method and Its Applications to Face Alignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Honglak Lee,et al.  Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Emad Barsoum,et al.  Emotion recognition in the wild from videos using images , 2016, ICMI.

[8]  Ayoub Al-Hamadi,et al.  Effective geometric features for human emotion recognition , 2012, 2012 IEEE 11th International Conference on Signal Processing.

[9]  Maja Pantic,et al.  Web-based database for facial expression analysis , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[10]  Björn W. Schuller,et al.  AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.

[11]  Min Chen,et al.  Feature Level Fusion for Bimodal Facial Action Unit Recognition , 2015, 2015 IEEE International Symposium on Multimedia (ISM).

[12]  Haizhou Li,et al.  Audio and face video emotion recognition in the wild using deep neural networks and small datasets , 2016, ICMI.

[13]  P. Ekman Facial expression and emotion. , 1993, The American psychologist.

[14]  Radu Tudor Ionescu,et al.  Local Learning to Improve Bag of Visual Words Model for Facial Expression Recognition , 2013 .

[15]  Kornel Laskowski,et al.  Combining Efforts for Improving Automatic Classification of Emotional User States , 2006 .

[16]  Qin Jin,et al.  Video emotion recognition in the wild based on fusion of multimodal features , 2016, ICMI.

[17]  Yuanliu Liu,et al.  Video-based emotion recognition using CNN-RNN and C3D hybrid networks , 2016, ICMI.

[18]  Ling Shao,et al.  Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Shiguang Shan,et al.  Combining Multiple Kernel Methods on Riemannian Manifold for Emotion Recognition in the Wild , 2014, ICMI.

[20]  Stefan Wermter,et al.  Face expression recognition with a 2-channel Convolutional Neural Network , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[21]  Hatice Gunes,et al.  Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[22]  Yau-Hwang Kuo,et al.  Learning collaborative decision-making parameters for multimodal emotion recognition , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[23]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Gayler, and David Hawking. Similarity-Aware Indexing for , 2009 .

[25]  Thierry Pun,et al.  Multimodal Emotion Recognition in Response to Videos , 2012, IEEE Transactions on Affective Computing.

[26]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[27]  Michael Wagner,et al.  Evaluating AAM fitting methods for facial expression recognition , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[28]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[29]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[30]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[31]  Matti Pietikäinen,et al.  Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Terence Sim,et al.  The CMU Pose, Illumination, and Expression Database , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Alberto Del Bimbo,et al.  A Set of Selected SIFT Features for 3D Facial Expression Recognition , 2010, 2010 20th International Conference on Pattern Recognition.

[34]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[35]  Emily Mower Provost,et al.  Wild wild emotion: a multimodal ensemble approach , 2016, ICMI.

[36]  Andrew Zisserman,et al.  Fisher Vector Faces in the Wild , 2013, BMVC.

[37]  Muhamad Taufik Abdullah,et al.  Region-Based Facial Expression Recognition in Still Images , 2013, J. Inf. Process. Syst..

[38]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[39]  Andrew Zisserman,et al.  A Compact and Discriminative Face Track Descriptor , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Don H. Johnson,et al.  Symmetrizing the Kullback-Leibler Distance , 2001 .

[41]  Takeo Kanade,et al.  The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[42]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[43]  Hazim Kemal Ekenel,et al.  Why is facial expression analysis in the wild challenging? , 2013, EmotiW '13.

[44]  Razvan Pascanu,et al.  Combining modality specific deep neural networks for emotion recognition in video , 2013, ICMI '13.

[45]  Jesse Hoey,et al.  EmotiW 2016: video and group-level emotion recognition challenges , 2016, ICMI.

[46]  Maja Pantic,et al.  Decision Level Fusion of Domain Specific Regions for Facial Action Recognition , 2014, 2014 22nd International Conference on Pattern Recognition.