Multi-modal Emotion Recognition with Temporal-Band Attention Based on LSTM-RNN

Emotion recognition is a key problem in Human-Computer Interaction (HCI). The multi-modal emotion recognition was discussed based on untrimmed visual signals and EEG signals in this paper. We propose a model with two attention mechanisms based on multi-layer Long short-term memory recurrent neural network (LSTM-RNN) for emotion recognition, which combines temporal attention and band attention. At each time step, the LSTM-RNN takes the video and EEG slice as inputs and generate representations of two signals, which are fed into a multi-modal fusion unit. Based on the fusion, our network predicts the emotion label and the next time slice for analyzing. Within the process, the model applies different levels of attention to different frequency bands of EEG signals through the band attention. With the temporal attention, it determines where to analyze next signal in order to suppress the redundant information for recognition. Experiments on Mahnob-HCI database demonstrate the encouraging results; the proposed method achieves higher accuracy and boosts the computational efficiency.

[1]  Olga Sourina,et al.  Real-Time EEG-Based Emotion Recognition and Its Applications , 2011, Trans. Comput. Sci..

[2]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[3]  M. Bradley,et al.  Measuring emotion: the Self-Assessment Manikin and the Semantic Differential. , 1994, Journal of behavior therapy and experimental psychiatry.

[4]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[5]  Dongmei Jiang,et al.  Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks , 2015, AVEC@ACM Multimedia.

[6]  Honglak Lee,et al.  Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Mohammad Soleymani,et al.  A Multimodal Database for Affect Recognition and Implicit Tagging , 2012, IEEE Transactions on Affective Computing.

[8]  Cigdem Eroglu Erdem,et al.  Multimodal emotion recognition with automatic peak frame selection , 2014, 2014 IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA) Proceedings.

[9]  Zhong Yin,et al.  Recognition of emotions using multimodal physiological signals and an ensemble deep learning model , 2017, Comput. Methods Programs Biomed..

[10]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[11]  Ioannis Patras,et al.  Fusion of facial expressions and EEG for implicit affective tagging , 2013, Image Vis. Comput..

[12]  Mohammed Yeasin,et al.  Learning Representations from EEG with Deep Recurrent-Convolutional Neural Networks , 2015, ICLR.

[13]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[14]  Matti Pietikäinen,et al.  Multi-modal emotion analysis from facial expressions and electroencephalogram , 2016, Comput. Vis. Image Underst..

[15]  Yong Peng,et al.  EEG-based emotion classification using deep belief networks , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[16]  R. Levenson,et al.  The Intrapersonal Functions of Emotion , 1999 .