Implicit Fusion by Joint Audiovisual Training for Emotion Recognition in Mono Modality

Despite significant advances in emotion recognition from one individual modality, previous studies fail to take advantage of other modalities to train models in mono-modal scenarios. In this work, we propose a novel joint training model which implicitly fuses audio and visual information in the training procedure for either speech or facial emotion recognition. Specifically, the model consists of one modality-specific network per individual modality and one shared network to map both audio and visual cues into final predictions. In the training process, we additionally take the loss from one auxiliary modality into account besides the main modality. To evaluate the effectiveness of the implicit fusion model, we conduct extensive experiments for mono-modal emotion classification and regression, and find that the implicit fusion models outperform the standard mono-modal training process.

[1]  Pascale Fung,et al.  A first look into a Convolutional Neural Network for speech emotion detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Ting Dang,et al.  An Investigation of Annotation Delay Compensation and Output-Associative Fusion for Multimodal Continuous Emotion Prediction , 2015, AVEC@ACM Multimedia.

[3]  Eduardo Coutinho,et al.  Dynamic Difficulty Awareness Training for Continuous Emotion Prediction , 2018, IEEE Transactions on Multimedia.

[4]  Yang Liu,et al.  A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space , 2017, IEEE Transactions on Affective Computing.

[5]  Jian Huang,et al.  End-to-End Continuous Emotion Recognition from Video Using 3D Convlstm Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2009, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Fabien Ringeval,et al.  Reconstruction-error-based learning for continuous emotion recognition in speech , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Chengxin Li,et al.  Speech emotion recognition with acoustic and lexical features , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Zixing Zhang,et al.  Adversarial Training in Affective Computing and Sentiment Analysis: Recent Advances and Perspectives , 2018, ArXiv.

[10]  Fabien Ringeval,et al.  AV+EC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data , 2015, AVEC@ACM Multimedia.

[11]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[12]  George Trigeorgis,et al.  End-to-End Multimodal Emotion Recognition Using Deep Neural Networks , 2017, IEEE Journal of Selected Topics in Signal Processing.

[13]  Björn Schuller,et al.  Strength modelling for real-worldautomatic continuous affect recognition from audiovisual signals , 2017, Image Vis. Comput..

[14]  A. Rogier [Communication without words]. , 1971, Tijdschrift voor ziekenverpleging.

[15]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[16]  Dongmei Jiang,et al.  Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks , 2015, AVEC@ACM Multimedia.

[17]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[18]  Hatice Gunes,et al.  Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[19]  Reza Lotfian,et al.  Curriculum Learning for Speech Emotion Recognition From Crowdsourced Labels , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Stefan Wermter,et al.  The OMG-Emotion Behavior Dataset , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[21]  Björn W. Schuller,et al.  From Hard to Soft: Towards more Human-like Emotion Recognition by Modelling the Perception Uncertainty , 2017, ACM Multimedia.

[22]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[23]  Ilia Uma physiological signals based human emotion recognition a review , 2014 .

[24]  Fabien Ringeval,et al.  AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition , 2018, AVEC@MM.

[25]  Zhigang Deng,et al.  Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.

[26]  Qin Jin,et al.  Multi-modal Multi-cultural Dimensional Continues Emotion Recognition in Dyadic Interactions , 2018, AVEC@MM.

[27]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[28]  Ling Guan,et al.  Recognizing human emotion from audiovisual information , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[29]  Graham W. Taylor,et al.  Deep Multimodal Learning: A Survey on Recent Advances and Trends , 2017, IEEE Signal Processing Magazine.