Low Rank Fusion based Transformers for Multimodal Sequences

Our senses individually work in a coordinated fashion to express our emotional intentions. In this work, we experiment with modeling modality-specific sensory signals to attend to our latent multimodal emotional intentions and vice versa expressed via low-rank multimodal fusion and multimodal transformers. The low-rank factorization of multimodal fusion amongst the modalities helps represent approximate multiplicative latent signal interactions. Motivated by the work of~\cite{tsai2019MULT} and~\cite{Liu_2018}, we present our transformer-based cross-fusion architecture without any over-parameterization of the model. The low-rank fusion helps represent the latent signal interactions while the modality-specific attention helps focus on relevant parts of the signal. We present two methods for the Multimodal Sentiment and Emotion Recognition results on CMU-MOSEI, CMU-MOSI, and IEMOCAP datasets and show that our models have lesser parameters, train faster and perform comparably to many larger fusion-based architectures.

[1]  Louis-Philippe Morency,et al.  Efficient Low-rank Multimodal Fusion With Modality-Specific Factors , 2018, ACL.

[2]  K. Scherer Emotion as a multicomponent process: A model and some cross-cultural data. , 1984 .

[3]  J. Russell,et al.  The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psychopathology , 2005, Development and Psychopathology.

[4]  Louis-Philippe Morency,et al.  Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages , 2016, IEEE Intelligent Systems.

[5]  Jianhai Zhang,et al.  Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling , 2019, NeurIPS.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Ruslan Salakhutdinov,et al.  Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[8]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[9]  L. Fleischer Telling Lies Clues To Deceit In The Marketplace Politics And Marriage , 2016 .

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  R. Plutchik Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice , 2016 .

[12]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[13]  Ruslan Salakhutdinov,et al.  Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization , 2019, ACL.

[14]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.