Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion

Learning joint embedding space for various modalities is of vital importance for multimodal fusion. Mainstream modality fusion approaches fail to achieve this goal, leaving a modality gap which heavily affects cross-modal fusion. In this paper, we propose a novel adversarial encoder-decoder-classifier framework to learn a modality-invariant embedding space. Since the distributions of various modalities vary in nature, to reduce the modality gap, we translate the distributions of source modalities into that of target modality via their respective encoders using adversarial training. Furthermore, we exert additional constraints on embedding space by introducing reconstruction loss and classification loss. Then we fuse the encoded representations using hierarchical graph neural network which explicitly explores unimodal, bimodal and trimodal interactions in multi-stage. Our method achieves state-of-the-art performance on multiple datasets. Visualization of the learned embeddings suggests that the joint embedding space learned by our method is discriminative.

[1]  Aaron C. Courville,et al.  Generative Adversarial Networks , 2022, 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT).

[2]  Songlong Xing,et al.  Locally Confined Modality Fusion Network With a Global Perspective for Multimodal Human Affective Computing , 2020, IEEE Transactions on Multimedia.

[3]  Songlong Xing,et al.  Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing , 2019, ACL.

[4]  Ruslan Salakhutdinov,et al.  Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[5]  Ruslan Salakhutdinov,et al.  Strong and Simple Baselines for Multimodal Utterance Embeddings , 2019, NAACL.

[6]  Huimin Lu,et al.  Deep adversarial metric learning for cross-modal retrieval , 2019, World Wide Web.

[7]  Barnabás Póczos,et al.  Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities , 2018, AAAI.

[8]  Louis-Philippe Morency,et al.  Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors , 2018, AAAI.

[9]  Louis-Philippe Morency,et al.  Multimodal Language Analysis with Recurrent Multistage Fusion , 2018, EMNLP.

[10]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[11]  Ruslan Salakhutdinov,et al.  Learning Factorized Multimodal Representations , 2018, ICLR.

[12]  Erik Cambria,et al.  Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling , 2018, Knowl. Based Syst..

[13]  Ivan Marsic,et al.  Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment , 2018, ACL.

[14]  Pascale Fung,et al.  Investigating Audio, Video, and Text Fusion Methods for End-to-End Automatic Personality Prediction , 2018, ACL.

[15]  Louis-Philippe Morency,et al.  Efficient Low-rank Multimodal Fusion With Modality-Specific Factors , 2018, ACL.

[16]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[17]  Erik Cambria,et al.  Multi-attention Recurrent Network for Human Communication Comprehension , 2018, AAAI.

[18]  Lan Wang,et al.  Multimodal learning using 3D audio-visual data for audio-visual speech recognition , 2017, 2017 International Conference on Asian Language Processing (IALP).

[19]  Joan Bruna,et al.  Few-Shot Learning with Graph Neural Networks , 2017, ICLR.

[20]  Yuxin Peng,et al.  CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning , 2017, ArXiv.

[21]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[22]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[23]  Erik Cambria,et al.  Context-Dependent Sentiment Analysis in User-Generated Videos , 2017, ACL.

[24]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Erik Cambria,et al.  Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[26]  Louis-Philippe Morency,et al.  Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages , 2016, IEEE Intelligent Systems.

[27]  Louis-Philippe Morency,et al.  Deep multimodal fusion for persuasiveness prediction , 2016, ICMI.

[28]  Navdeep Jaitly,et al.  Adversarial Autoencoders , 2015, ArXiv.

[29]  Jianmin Wang,et al.  Semantics-preserving hashing for cross-view retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[32]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Björn W. Schuller,et al.  YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context , 2013, IEEE Intelligent Systems.

[35]  J. Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[36]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[37]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[38]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[39]  B. Liu,et al.  A Survey of Opinion Mining and Sentiment Analysis , 2012, Mining Text Data.

[40]  Ming Yang,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 3d Convolutional Neural Networks for Human Action Recognition , 2022 .