Multi-modal Correlated Network for emotion recognition in speech

Abstract With the growing demand of automatic emotion recognition system, emotion recognition is becoming more and more crucial for human–computer interaction (HCI) research. Recently, there is a continuous improvement in the performance of automatic emotion recognition due to the development of both hardware and deep learning methods. However, because of the abstract concept and multiple expressions of emotion, automatic emotion recognition is still a challenging task. In this paper, we propose a novel Multi-modal Correlated Network for emotion recognition aiming at exploiting the information from both audio and visual channels to achieve more robust and accurate detection. In the proposed method, the audio signals and visual signals are first preprocessed for the feature extraction. After preprocessing, we obtain the Mel-spectrograms, which can be treated as images, and the representative frames from visual segments. Then the Mel-spectrograms are fed to the convolutional neural network (CNN) to get the audio features and the representative frames are fed to the CNN and LSTM to get features. Specially, we employ the triplet loss to increase the differentiation of inter-class. Meanwhile, we propose a novel correlated loss to reduce the differentiation of intra-class. Finally, we apply the feature fusion method to fuse the audio and visual feature for emotion recognition classification. The experimental result on AEFW dataset demonstrates the correlation information of multiple modals is crucial for automatic emotion recognition and the proposed method can achieve the state-of-the-art performance on the classification task.

[1]  Haizhou Li,et al.  Audio and face video emotion recognition in the wild using deep neural networks and small datasets , 2016, ICMI.

[2]  M. Shamim Hossain,et al.  Emotion recognition using deep learning approach from audio-visual emotional big data , 2019, Inf. Fusion.

[3]  Tan Lee,et al.  Revisiting Hidden Markov Models for Speech Emotion Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Ping Hu,et al.  HoloNet: towards robust emotion recognition in the wild , 2016, ICMI.

[5]  C. Darwin The Expression of the Emotions in Man and Animals , .

[6]  Jiangtao Wang,et al.  Energy Saving Techniques in Mobile Crowd Sensing: Current State and Future Opportunities , 2018, IEEE Communications Magazine.

[7]  M. Shamim Hossain,et al.  Verifying the Images Authenticity in Cognitive Internet of Things (CIoT)-Oriented Cyber Physical System , 2017, Mob. Networks Appl..

[8]  Byung Cheol Song,et al.  Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in the wild , 2017, ICMI.

[9]  Patrick Cardinal,et al.  Emotion Recognition Using Fusion of Audio and Video Features , 2019, 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC).

[10]  Min Chen,et al.  SPHA: Smart Personal Health Advisor Based on Deep Analytics , 2018, IEEE Communications Magazine.

[11]  Emad Barsoum,et al.  Emotion recognition in the wild from videos using images , 2016, ICMI.

[12]  Frédéric Jurie,et al.  Temporal multimodal fusion for video emotion classification in the wild , 2017, ICMI.

[13]  Daxin Tian,et al.  System Design for Big Data Application in Emotion-Aware Healthcare , 2016, IEEE Access.

[14]  Takeo Kanade,et al.  Recognizing Action Units for Facial Expression Analysis , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Zhiyuan Li,et al.  Feature-Level and Model-Level Audiovisual Fusion for Emotion Recognition in the Wild , 2019, 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR).

[16]  Bo Sun,et al.  Facial expression recognition in the wild based on multimodal texture features , 2016, J. Electronic Imaging.

[17]  Xiaofei Wang,et al.  Smart Home 2.0: Innovative Smart Home System Powered by Botanical IoT and Emotion Detection , 2017, Mob. Networks Appl..

[18]  Ping Hu,et al.  Learning supervised scoring ensemble for emotion recognition in the wild , 2017, ICMI.

[19]  Jun Du,et al.  Deep Fusion: An Attention Guided Factorized Bilinear Pooling for Audio-video Emotion Recognition , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[20]  Behzad Hassani,et al.  Bounded Residual Gradient Networks (BReG-Net) for Facial Affect Computing , 2019, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[21]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[22]  Hong Zhang,et al.  Facial expression recognition via learning deep sparse autoencoders , 2018, Neurocomputing.

[23]  Margaret Lech,et al.  Evaluating deep learning architectures for Speech Emotion Recognition , 2017, Neural Networks.

[24]  Jun-Wei Mao,et al.  Speech emotion recognition based on feature selection and extreme learning machine decision tree , 2018, Neurocomputing.

[25]  M. Shamim Hossain,et al.  An Emotion Recognition System for Mobile Applications , 2017, IEEE Access.

[26]  Hongbin Zha,et al.  Multi-view common space learning for emotion recognition in the wild , 2016, ICMI.

[27]  Abdulmotaleb El-Saddik,et al.  Detection and Visualization of Emotions in an Affect-Aware City , 2014, EMASC '14.

[28]  Wen Gao,et al.  Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[29]  Albert Ali Salah,et al.  Video-based emotion recognition in the wild using deep transfer learning and score fusion , 2017, Image Vis. Comput..

[30]  Tamás D. Gedeon,et al.  Collecting Large, Richly Annotated Facial-Expression Databases from Movies , 2012, IEEE MultiMedia.

[31]  Matti Pietikäinen,et al.  Multi-modal emotion analysis from facial expressions and electroencephalogram , 2016, Comput. Vis. Image Underst..

[32]  Dong-Yan Huang,et al.  Audio-visual emotion recognition using deep transfer learning and multiple temporal models , 2017, ICMI.

[33]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[34]  Yafeng Niu,et al.  A breakthrough in Speech emotion recognition using Deep Retinal Convolution Neural Networks , 2017, ArXiv.

[35]  Leo Galway,et al.  Towards emotion recognition for virtual environments: an evaluation of eeg features on benchmark dataset , 2017, Personal and Ubiquitous Computing.