Cross-culture Multimodal Emotion Recognition with Adversarial Learning

With the development of globalization, automatic emotion recognition has faced a new challenge in the multi-culture scenario - to generalize across different cultures. Previous works mainly rely on multi-cultural datasets to address the cross-culture discrepancy, which are expensive to collect. In this paper, we propose an adversarial learning framework to alleviate the culture influence on multimodal emotion recognition. We treat the emotion recognition and culture recognition as two adversarial tasks. The emotion feature embedding is trained to improve the emotion recognition but to confuse the culture recognition, so that it is more emotion-salient and culture-invariant for cross-culture emotion recognition. Our approach is applicable to both mono-culture and multi-culture emotion datasets. Extensive experiments demonstrate that the proposed method significantly outperforms previous baselines in both cross-culture and multi-culture evaluations.

[1]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[2]  Ngoc Thang Vu,et al.  CRoss-lingual and Multilingual Speech Emotion Recognition on English and French , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Björn W. Schuller,et al.  Cross-language acoustic emotion recognition: An overview and some tendencies , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[4]  Erik Marchi,et al.  Enhancing Multilingual Recognition of Emotion in Speech by Language Identification , 2016, INTERSPEECH.

[5]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[6]  Chia-Ping Chen,et al.  Speech emotion recognition with cross-lingual databases , 2014, INTERSPEECH.

[7]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[8]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[9]  Qin Jin,et al.  Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction , 2016, ACM Multimedia.

[10]  Rohit Kumar,et al.  Emotion Recognition using Acoustic and Lexical Features , 2012, INTERSPEECH.

[11]  Shizhe Chen,et al.  Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition , 2017, AVEC@ACM Multimedia.

[12]  Ya Li,et al.  MEC 2017: Multimodal Emotion Recognition Challenge , 2018, 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia).

[13]  William M. Campbell,et al.  Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction , 2016, AVEC@ACM Multimedia.

[14]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[15]  Emad Barsoum,et al.  Training deep networks for facial expression recognition with crowd-sourced label distribution , 2016, ICMI.

[16]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Pascal Fua,et al.  Beyond Sharing Weights for Deep Domain Adaptation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Tamás D. Gedeon,et al.  Collecting Large, Richly Annotated Facial-Expression Databases from Movies , 2012, IEEE MultiMedia.

[19]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Firoj Alam,et al.  Emotion Unfolding and Affective Scenes: A Case Study in Spoken Conversations , 2015, ERM4CT@ICMI.