Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis

Multimodal emotion recognition is a challenging research topic which has recently started to attract the attention of the research community. To better recognize the video users’ emotion, the research of multimodal emotion recognition based on audio and video is essential. Multimodal emotion recognition performance heavily depends on finding good shared feature representation. The good shared representation needs to consider two aspects: (1) it has the character of each modality and (2) it can balance the effect of different modalities to make the decision optimal. In the light of these, we propose a novel Enhanced Sparse Local Discriminative Canonical Correlation Analysis approach (En-SLDCCA) to learn the multimodal shared feature representation. The shared feature representation learning involves two stages. In the first stage, we pretrain the Sparse Auto-Encoder with unimodal video (or audio), so that we can obtain the hidden feature representation of video and audio separately. In the second stage, we obtain the correlation coefficients of video and audio using our En-SLDCCA approach, then we form the shared feature representation which fuses the features from video and audio using the correlation coefficients. We evaluate the performance of our method on the challenging multimodal Enterface’05 database. Experimental results reveal that our method is superior to the unimodal video (or audio) and improves significantly the performance for multimodal emotion recognition when compared with the current state of the art.

[1]  Zhihong Zeng,et al.  Audio-Visual Affect Recognition , 2007, IEEE Transactions on Multimedia.

[2]  Haixian Wang,et al.  Local Two-Dimensional Canonical Correlation Analysis , 2010, IEEE Signal Processing Letters.

[3]  Shaogang Gong,et al.  Beyond Facial Expressions: Learning Human Emotion from Body Gestures , 2007, BMVC.

[4]  Ling Guan,et al.  Audiovisual emotion recognition via cross-modal association in kernel space , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[5]  Hatice Gunes,et al.  From the Lab to the real world: affect recognition using multiple cues and modalities , 2008 .

[6]  Emily Mower Provost,et al.  Emotion recognition from spontaneous speech using Hidden Markov models with deep belief networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[7]  Björn W. Schuller,et al.  Acoustic emotion recognition: A benchmark comparison of performances , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[8]  Jianhua Tao,et al.  Combining Audio and Video by Dominance in Bimodal Emotion Recognition , 2007, ACII.

[9]  Erik Marchi,et al.  Sparse Autoencoder-Based Feature Transfer Learning for Speech Emotion Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[10]  Benoit Huet,et al.  Bimodal Emotion Recognition , 2010, ICSR.

[11]  Vitomir Štruc,et al.  Towards Efficient Multi-Modal Emotion Recognition , 2013 .

[12]  Ashish Kapoor,et al.  Automatic prediction of frustration , 2007, Int. J. Hum. Comput. Stud..

[13]  Benoit Huet,et al.  Toward emotion indexing of multimedia excerpts , 2008, 2008 International Workshop on Content-Based Multimedia Indexing.

[14]  S. Voloshynovskiy,et al.  Brain-computer interaction research at the computer vision and multimedia laboratory, University of Geneva , 2006, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[15]  John Shawe-Taylor,et al.  Sparse canonical correlation analysis , 2009, Machine Learning.

[16]  Meng Wang,et al.  Tri-Clustered Tensor Completion for Social-Aware Image Tag Refinement , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Nasrollah Moghaddam Charkari,et al.  Multimodal information fusion application to human emotion recognition from face and speech , 2010, Multimedia Tools and Applications.

[18]  Tsutomu Miyasato,et al.  Multimodal human emotion/expression recognition , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[19]  André Stuhlsatz,et al.  Feature Extraction With Deep Neural Networks by a Generalized Discriminant Analysis , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Rok Gajsek,et al.  Multi-modal Emotion Recognition Using Canonical Correlations and Acoustic Features , 2010, 2010 20th International Conference on Pattern Recognition.

[21]  Yi Yang,et al.  Beyond Doctors: Future Health Prediction from Multimedia and Multimodal Observations , 2015, ACM Multimedia.

[22]  Michael Wolf,et al.  Nonlinear Shrinkage Estimation of Large-Dimensional Covariance Matrices , 2011 .

[23]  Erik M. Schmidt,et al.  Modeling and Predicting Emotion in Music , 2012 .

[24]  Daqing He,et al.  Scholarly Collaboration on the Academic Social Web , 2016, Synthesis Lectures on Information Concepts, Retrieval, and Services.

[25]  Benoit Huet,et al.  Evidence Theory-Based Multimodal Emotion Recognition , 2009, MMM.

[26]  Joseph J. Capista On Music , 2013 .

[27]  Tat-Seng Chua,et al.  Learning from Multiple Social Networks , 2016, Synthesis Lectures on Information Concepts, Retrieval, and Services.

[28]  Jing Liu,et al.  Clustering-Guided Sparse Structural Learning for Unsupervised Feature Selection , 2014, IEEE Transactions on Knowledge and Data Engineering.

[29]  Olivier Ledoit,et al.  Nonlinear Shrinkage Estimation of Large-Dimensional Covariance Matrices , 2011, 1207.5322.

[30]  Alfred O. Hero,et al.  Shrinkage Algorithms for MMSE Covariance Estimation , 2009, IEEE Transactions on Signal Processing.

[31]  Zhigang Deng,et al.  Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.

[32]  Bir Bhanu,et al.  Person Re-Identification by Robust Canonical Correlation Analysis , 2015, IEEE Signal Processing Letters.

[33]  Kai-Tai Song,et al.  A new information fusion method for SVM-based robotic audio-visual emotion recognition , 2007, 2007 IEEE International Conference on Systems, Man and Cybernetics.

[34]  Shiguang Shan,et al.  Learning Expressionlets on Spatio-temporal Manifold for Dynamic Facial Expression Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[36]  Daoqiang Zhang,et al.  A New Canonical Correlation Analysis Algorithm with Local Discrimination , 2010, Neural Processing Letters.

[37]  Honglak Lee,et al.  Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38]  Matti Pietikäinen,et al.  Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Ioannis Pitas,et al.  The eNTERFACE’05 Audio-Visual Emotion Database , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[40]  Jing Liu,et al.  Robust Structured Subspace Learning for Data Representation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Jiwen Lu,et al.  Activity-based person identification using sparse coding and discriminative metric learning , 2012, ACM Multimedia.

[42]  Jinhui Tang,et al.  Unsupervised Feature Selection via Nonnegative Spectral Analysis and Redundancy Control , 2015, IEEE Transactions on Image Processing.

[43]  Vaibhava Goel,et al.  Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Jinhui Tang,et al.  Weakly Supervised Deep Matrix Factorization for Social Image Understanding , 2017, IEEE Transactions on Image Processing.

[45]  L. Rothkrantz Multimodal recognition of emotions in car environments , 2009 .