Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition

Emotion recognition is challenging due to the emotional gap between emotions and audio–visual features. Motivated by the powerful feature learning ability of deep neural networks, this paper proposes to bridge the emotional gap by using a hybrid deep model, which first produces audio–visual segment features with Convolutional Neural Networks (CNNs) and 3D-CNN, then fuses audio–visual segment features in a Deep Belief Networks (DBNs). The proposed method is trained in two stages. First, CNN and 3D-CNN models pre-trained on corresponding large-scale image and video classification tasks are fine-tuned on emotion recognition tasks to learn audio and visual segment features, respectively. Second, the outputs of CNN and 3D-CNN models are combined into a fusion network built with a DBN model. The fusion network is trained to jointly learn a discriminative audio–visual segment feature representation. After average-pooling segment features learned by DBN to form a fixed-length global video feature, a linear Support Vector Machine is used for video emotion classification. Experimental results on three public audio–visual emotional databases, including the acted RML database, the acted eNTERFACE05 database, and the spontaneous BAUM-1s database, demonstrate the promising performance of the proposed method. To the best of our knowledge, this is an early work fusing audio and visual cues with CNN, 3D-CNN, and DBN for audio–visual emotion recognition.

[1]  Shiqing Zhang,et al.  A Review on Facial Expression Recognition: Feature Extraction and Classification , 2016 .

[2]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Yuming Zhou,et al.  A novel ensemble method for classifying imbalanced data , 2015, Pattern Recognit..

[5]  Nicholas B. Allen,et al.  Stress and emotion recognition using log-Gabor filter analysis of speech spectrograms , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[6]  Björn W. Schuller,et al.  Context-Sensitive Learning for Enhanced Audiovisual Emotion Classification , 2012, IEEE Transactions on Affective Computing.

[7]  Chung-Hsien Wu,et al.  Error Weighted Semi-Coupled Hidden Markov Model for Audio-Visual Emotion Recognition , 2012, IEEE Transactions on Multimedia.

[8]  Björn W. Schuller,et al.  Audiovisual recognition of spontaneous interest within conversations , 2007, ICMI '07.

[9]  Reza Boostani,et al.  FF-SKPCCA: Kernel probabilistic canonical correlation analysis , 2017, Applied Intelligence.

[10]  Yifeng He,et al.  Multiview emotion recognition via multi-set locality preserving canonical correlation analysis , 2016, 2016 IEEE International Symposium on Circuits and Systems (ISCAS).

[11]  Chong-Wah Ngo,et al.  Mutlimodal Learning with Deep Boltzmann Machine for Emotion Prediction in User Generated Videos , 2015, ICMR.

[12]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[13]  Chun Chen,et al.  A robust multimodal approach for emotion recognition , 2008, Neurocomputing.

[14]  Emily Mower Provost,et al.  Identifying salient sub-utterance emotion dynamics using flexible units and estimates of affective flow , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Markus Kächele,et al.  Multiple Classifier Systems for the Classification of Audio-Visual Emotional States , 2011, ACII.

[16]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[17]  Chong-Wah Ngo,et al.  Deep Multimodal Learning for Affective Analysis and Retrieval , 2015, IEEE Transactions on Multimedia.

[18]  Shiqing Zhang,et al.  Facial expression recognition using local binary patterns and discriminant kernel locally linear embedding , 2012, EURASIP Journal on Advances in Signal Processing.

[19]  Matti Pietikäinen,et al.  Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[21]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[22]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[23]  Zhengyou Zhang,et al.  Comparison between geometry-based and Gabor-wavelets-based facial expression recognition using multi-layer perceptron , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[24]  Honglak Lee,et al.  Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Thierry Pun,et al.  Multimodal Emotion Recognition in Response to Videos , 2012, IEEE Transactions on Affective Computing.

[26]  Björn W. Schuller,et al.  LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework , 2013, Image Vis. Comput..

[27]  Geoffrey E. Hinton,et al.  A Better Way to Pretrain Deep Boltzmann Machines , 2012, NIPS.

[28]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Bingbing Ni,et al.  Person Re-identification via Recurrent Feature Aggregation , 2016, ECCV.

[30]  M. Shamim Hossain,et al.  Audio–Visual Emotion-Aware Cloud Gaming Framework , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[31]  A. Hanjalic,et al.  Extracting moods from pictures and sounds: towards truly personalized TV , 2006, IEEE Signal Processing Magazine.

[32]  Erik Cambria,et al.  Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis , 2017, Neurocomputing.

[33]  Björn W. Schuller,et al.  AVEC 2013: the continuous audio/visual emotion and depression recognition challenge , 2013, AVEC@ACM Multimedia.

[34]  Johannes Wagner,et al.  Exploring Fusion Methods for Multimodal Emotion Recognition with Missing Data , 2011, IEEE Transactions on Affective Computing.

[35]  Ling Guan,et al.  Multimodal Information Fusion of Audio Emotion Recognition Based on Kernel Entropy Component Analysis , 2013, Int. J. Semantic Comput..

[36]  Lei Gao,et al.  Information fusion based on kernel entropy component analysis in discriminative canonical correlation space with application to audio emotion recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Tamás D. Gedeon,et al.  Video and Image based Emotion Recognition Challenges in the Wild: EmotiW 2015 , 2015, ICMI.

[38]  Cécile Barat,et al.  String representations and distances in deep Convolutional Neural Networks for image classification , 2016, Pattern Recognit..

[39]  Tamás D. Gedeon,et al.  Emotion recognition using PHOG and LPQ features , 2011, Face and Gesture 2011.

[40]  Qi Tian,et al.  MARS: A Video Benchmark for Large-Scale Person Re-Identification , 2016, ECCV.

[41]  Yifeng He,et al.  Multiview learning via deep discriminative canonical correlation analysis , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[43]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[44]  Ling Guan,et al.  Kernel Cross-Modal Factor Analysis for Information Fusion With Application to Bimodal Emotion Recognition , 2012, IEEE Transactions on Multimedia.

[45]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[46]  Maja Pantic,et al.  Facial Expression Recognition , 2009, Encyclopedia of Biometrics.

[47]  Yixin Chen,et al.  Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[48]  Cigdem Eroglu Erdem,et al.  BAUM-1: A Spontaneous Audio-Visual Face Database of Affective and Mental States , 2017, IEEE Transactions on Affective Computing.

[49]  Björn W. Schuller,et al.  Acoustic emotion recognition: A benchmark comparison of performances , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[50]  Shaogang Gong,et al.  Facial expression recognition based on Local Binary Patterns: A comprehensive study , 2009, Image Vis. Comput..

[51]  Carlos Busso,et al.  Using neutral speech models for emotional speech analysis , 2007, INTERSPEECH.

[52]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[53]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[54]  Zhihong Zeng,et al.  Audio–Visual Affective Expression Recognition Through Multistream Fused HMM , 2008, IEEE Transactions on Multimedia.

[55]  Xiaogang Wang,et al.  DeepID-Net: Deformable deep convolutional neural networks for object detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[57]  Haizhou Li,et al.  Audio and face video emotion recognition in the wild using deep neural networks and small datasets , 2016, ICMI.

[58]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[59]  Zhigang Deng,et al.  Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.

[60]  Nasrollah Moghaddam Charkari,et al.  Audiovisual emotion recognition using ANOVA feature selection method and multi-classifier neural networks , 2014, Neural Computing and Applications.

[61]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[62]  Tanaya Guha,et al.  Multimodal Prediction of Affective Dimensions and Depression in Human-Computer Interactions , 2014, AVEC '14.

[63]  Mansour Sheikhan,et al.  Audio-visual emotion recognition using FCBF feature selection method and particle swarm optimization for fuzzy ARTMAP neural networks , 2015, Multimedia Tools and Applications.

[64]  Shiliang Zhang,et al.  Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition , 2016, ICMR.

[65]  Nasrollah Moghaddam Charkari,et al.  Multimodal information fusion application to human emotion recognition from face and speech , 2010, Multimedia Tools and Applications.

[66]  Ling Guan,et al.  Recognizing Human Emotional State From Audiovisual Signals , 2008, IEEE Transactions on Multimedia.

[67]  Deepak Khosla,et al.  Spiking Deep Convolutional Neural Networks for Energy-Efficient Object Recognition , 2014, International Journal of Computer Vision.