Deep Spatio-Temporal Features for Multimodal Emotion Recognition

Automatic emotion recognition has attracted great interest and numerous solutions have been proposed, most of which focus either individually on facial expression or acoustic information. While more recent research has considered multimodal approaches, individual modalities are often combined only by simple fusion at the feature and/or decision-level. In this paper, we introduce a novel approach using 3-dimensional convolutional neural networks (C3Ds) to model the spatio-temporal information, cascaded with multimodal deep-belief networks (DBNs) that can represent the audio and video streams. Experiments conducted on the eNTERFACE multimodal emotion database demonstrate that this approach leads to improved multimodal emotion recognition performance and significantly outperforms recent state-of-the-art proposals.

[1]  Loïc Kessous,et al.  Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis , 2010, Journal on Multimodal User Interfaces.

[2]  Y. X. Zou,et al.  An experimental study of speech emotion recognition based on deep convolutional neural networks , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[3]  Wei Liu,et al.  Multimodal Emotion Recognition Using Multimodal Deep Learning , 2016, ArXiv.

[4]  Michael Wagner,et al.  A multilevel fusion approach for audiovisual emotion recognition , 2008, AVSP.

[5]  Zhigang Deng,et al.  Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.

[6]  Eugen Lupu,et al.  Emotions recognition by speechand facial expressions analysis , 2009, 2009 17th European Signal Processing Conference.

[7]  Qiang Ji,et al.  Exploiting Dynamic Dependencies Among Action Units for Spontaneous Facial Action Recognition , 2015 .

[8]  Sethuraman Panchanathan,et al.  Multimodal emotion recognition using deep learning architectures , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[9]  Ling Guan,et al.  Recognizing human emotion from audiovisual information , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10]  Björn W. Schuller,et al.  Low-Level Fusion of Audio, Video Feature for Multi-Modal Emotion Recognition , 2008, VISAPP.

[11]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[12]  Ioannis Pitas,et al.  The eNTERFACE’05 Audio-Visual Emotion Database , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[13]  Loïc Kessous,et al.  Modeling naturalistic affective states via facial and vocal expressions recognition , 2006, ICMI '06.

[14]  Kristian Kroschel,et al.  Audio-visual emotion recognition using an emotion space concept , 2008, 2008 16th European Signal Processing Conference.

[15]  Munaf Rashid,et al.  Human emotion recognition from videos using spatio-temporal and audio features , 2012, The Visual Computer.

[16]  Christine L. Lisetti,et al.  Toward multimodal fusion of affective cues , 2006, HCM '06.

[17]  Amit Konar,et al.  Introduction to Emotion Recognition , 2015 .

[18]  Nasrollah Moghaddam Charkari,et al.  Multimodal information fusion application to human emotion recognition from face and speech , 2010, Multimedia Tools and Applications.

[19]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[21]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[22]  Kai-Tai Song,et al.  A New Information Fusion Method for Bimodal Robotic Emotion Recognition , 2008, J. Comput..

[23]  Léon J. M. Rothkrantz,et al.  Semantic Audiovisual Data Fusion for Automatic Emotion Recognition , 2015 .

[24]  Dae-Jong Lee,et al.  Emotion recognition from the facial image and speech signal , 2003, SICE 2003 Annual Conference (IEEE Cat. No.03TH8734).

[25]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[26]  Björn W. Schuller,et al.  Efficient Recognition of Authentic Dynamic Facial Expressions on the Feedtum Database , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[27]  Johannes Wagner,et al.  Building a Robust System for Multimodal Emotion Recognition , 2015 .

[28]  Honglak Lee,et al.  Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Kwee-Bo Sim,et al.  Emotion Recognition Method Based on Multimodal Sensor Fusion Algorithm , 2008, Int. J. Fuzzy Log. Intell. Syst..

[30]  Nicu Sebe,et al.  Emotion Recognition Based on Joint Visual and Audio Cues , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[31]  Qingmei Yao,et al.  Multi-Sensory Emotion Recognition with Speech and Facial Expression , 2014 .

[32]  Benoit Huet,et al.  Features for multimodal emotion recognition: An extensive study , 2010, 2010 IEEE Conference on Cybernetics and Intelligent Systems.

[33]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[34]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[35]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Stefanos D. Kollias,et al.  On emotion recognition of faces and of speech using neural networks, fuzzy logic and the ASSESS system , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[37]  Ling Guan,et al.  Kernel Fusion of Audio and Visual Information for Emotion Recognition , 2011, ICIAR.

[38]  Chun Chen,et al.  Audio-visual based emotion recognition - a new approach , 2004, CVPR 2004.

[39]  Nicu Sebe,et al.  Multimodal approaches for emotion recognition: a survey , 2005, IS&T/SPIE Electronic Imaging.

[40]  Benoit Huet,et al.  Toward emotion indexing of multimedia excerpts , 2008, 2008 International Workshop on Content-Based Multimedia Indexing.

[41]  Wei Liu,et al.  Emotion Recognition Using Multimodal Deep Learning , 2016, ICONIP.

[42]  Gerhard Rigoll,et al.  Bimodal fusion of emotional data in an automotive environment , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[43]  Thomas S. Huang,et al.  Emotion Recognition Based on Multimodal Information , 2009, Affective Information Processing.

[44]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..