CTFN: Hierarchical Learning for Multimodal Sentiment Analysis Using Coupled-Translation Fusion Network

Multimodal sentiment analysis is the challenging research area that attends to the fusion of multiple heterogeneous modalities. The main challenge is the occurrence of some missing modalities during the multimodal fusion procedure. However, the existing techniques require all modalities as input, thus are sensitive to missing modalities at predicting time. In this work, the coupled-translation fusion network (CTFN) is firstly proposed to model bi-direction interplay via couple learning, ensuring the robustness in respect to missing modalities. Specifically, the cyclic consistency constraint is presented to improve the translation performance, allowing us directly to discard decoder and only embraces encoder of Transformer. This could contribute to a much lighter model. Due to the couple learning, CTFN is able to conduct bi-direction cross-modality intercorrelation parallelly. Based on CTFN, a hierarchical architecture is further established to exploit multiple bi-direction translations, leading to double multimodal fusing embeddings compared with traditional translation methods. Moreover, the convolution block is utilized to further highlight explicit interactions among those translations. For evaluation, CTFN was verified on two multimodal benchmarks with extensive ablation studies. The experiments demonstrate that the proposed framework achieves state-of-the-art or often competitive performance. Additionally, CTFN still maintains robustness when considering missing modality.

[1]  Louis-Philippe Morency,et al.  Integrating Multimodal Information in Large Pretrained Transformers , 2020, ACL.

[2]  Ruslan Salakhutdinov,et al.  Interpretable Multimodal Routing for Human Multimodal Language , 2020, ArXiv.

[3]  Zilong Wang,et al.  TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis , 2020, WWW.

[4]  Doyen Sahoo,et al.  Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems , 2019, ACL.

[5]  Ruslan Salakhutdinov,et al.  Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[6]  Dinesh Kumar Vishwakarma,et al.  Multimodal Sentiment Analysis via RNN variants , 2019, 2019 IEEE International Conference on Big Data, Cloud Computing, Data Science & Engineering (BCD).

[7]  Yu-Chiang Frank Wang,et al.  Dual-modality Seq2Seq Network for Audio-visual Event Localization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Barnabás Póczos,et al.  Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities , 2018, AAAI.

[9]  Hua Xu,et al.  Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network , 2018, AffCon@AAAI.

[10]  Rada Mihalcea,et al.  MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations , 2018, ACL.

[11]  James R. Glass,et al.  Detecting Depression with Audio/Text Sequence Modeling of Interviews , 2018, INTERSPEECH.

[12]  Barnabás Póczos,et al.  Seq2Seq2Sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis , 2018, ArXiv.

[13]  Erik Cambria,et al.  Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling , 2018, Knowl. Based Syst..

[14]  Bertram E. Shi,et al.  Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features , 2018, ArXiv.

[15]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[16]  Radu Horaud,et al.  Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Pushpak Bhattacharyya,et al.  Contextual Inter-modal Attention for Multi-modal Sentiment Analysis , 2018, EMNLP.

[18]  Gang Hua,et al.  Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[20]  Erik Cambria,et al.  Context-Dependent Sentiment Analysis in User-Generated Videos , 2017, ACL.

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Osmar R. Zaïane,et al.  Current State of Text Sentiment Analysis from Opinion to Emotion Mining , 2017, ACM Comput. Surv..

[23]  Tao Chen,et al.  Expert Systems With Applications , 2022 .

[24]  Pavlo Molchanov,et al.  Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification , 2016, ACM Multimedia.

[25]  Louis-Philippe Morency,et al.  MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos , 2016, ArXiv.

[26]  Jean Maillard,et al.  Black Holes and White Rabbits: Metaphor Identification with Visual Features , 2016, NAACL.

[27]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Christopher Joseph Pal,et al.  EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.

[29]  Sidney K. D'Mello,et al.  A Review and Meta-Analysis of Multimodal Affect Detection Systems , 2015, ACM Comput. Surv..

[30]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[31]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[32]  Wei Liu,et al.  Multimedia classification and event detection using double fusion , 2013, Multimedia Tools and Applications.

[33]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.