MFM Neural Architecture MFM Generative Network ( b ) MFM Inference Network Inference Network Generative Network

Learning multimodal representations is a fundamentally complex research problem due to the presence of multiple heterogeneous sources of information. Although the presence of multiple modalities provides additional valuable information, there are two key challenges to address when learning from multimodal data: 1) models must learn the complex intra-modal and cross-modal interactions for prediction and 2) models must be robust to unexpected missing or noisy modalities during testing. In this paper, we propose to optimize for a joint generative-discriminative objective across multimodal data and labels. We introduce a model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors. Multimodal discriminative factors are shared across all modalities and contain joint multimodal features required for discriminative tasks such as sentiment prediction. Modality-specific generative factors are unique for each modality and contain the information required for generating data. Experimental results show that our model is able to learn meaningful multimodal representations that achieve state-of-the-art or competitive performance on six multimodal datasets. Our model demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance. Lastly, we interpret our factorized representations to understand the interactions that influence multimodal learning.

[1]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Louis-Philippe Morency,et al.  Multimodal Local-Global Ranking Fusion for Emotion Recognition , 2018, ICMI.

[3]  Barnabás Póczos,et al.  Seq2Seq2Sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis , 2018, ArXiv.

[4]  James R. Glass,et al.  Disentangling by Partitioning: A Representation Learning Framework for Multimodal Sensory Data , 2018, ArXiv.

[5]  Louis-Philippe Morency,et al.  Efficient Low-rank Multimodal Fusion With Modality-Specific Factors , 2018, ACL.

[6]  Makoto Yamada,et al.  "Dependency Bottleneck" in Auto-encoding Architectures: an Empirical Study , 2018, ArXiv.

[7]  Bernhard Schölkopf,et al.  On the Latent Space of Wasserstein Auto-Encoders , 2018, ArXiv.

[8]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[9]  Erik Cambria,et al.  Multi-attention Recurrent Network for Human Communication Comprehension , 2018, AAAI.

[10]  Ruslan Salakhutdinov,et al.  Gated-Attention Architectures for Task-Oriented Language Grounding , 2017, AAAI.

[11]  Sen Wang,et al.  Multimodal sentiment analysis with word-level fusion and reinforcement learning , 2017, ICMI.

[12]  Ruslan Salakhutdinov,et al.  Improving One-Shot Learning through Fusing Side Information , 2017, ArXiv.

[13]  Min Wu,et al.  A facial expression emotion recognition based human-robot interaction system , 2017, IEEE/CAA Journal of Automatica Sinica.

[14]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[15]  Erik Cambria,et al.  Context-Dependent Sentiment Analysis in User-Generated Videos , 2017, ACL.

[16]  Stefano Ermon,et al.  InfoVAE: Balancing Learning and Inference in Variational Autoencoders , 2019, AAAI.

[17]  Ruslan Salakhutdinov,et al.  Learning Robust Visual-Semantic Embeddings , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Alla Anohina-Naumeca,et al.  Emotion Recognition in Affective Tutoring Systems , 2017 .

[19]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[20]  Masahiro Suzuki,et al.  Joint Multimodal Learning with Deep Generative Models , 2016, ICLR.

[21]  Louis-Philippe Morency,et al.  Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages , 2016, IEEE Intelligent Systems.

[22]  Louis-Philippe Morency,et al.  Deep multimodal fusion for persuasiveness prediction , 2016, ICMI.

[23]  Roland Göcke,et al.  Extending Long Short-Term Memory for Multi-View Structured Learning , 2016, ECCV.

[24]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[25]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[26]  Scott E. Reed,et al.  Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis , 2015, NIPS.

[27]  Yoshua Bengio,et al.  Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks , 2015, IEEE Transactions on Multimedia.

[28]  Serge J. Belongie,et al.  Bayesian representation learning with oracle constraints , 2015, ICLR 2016.

[29]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[30]  Bruno A. Olshausen,et al.  Discovering Hidden Factors of Variation in Deep Networks , 2014, ICLR.

[31]  Honglak Lee,et al.  Improved Multimodal Deep Learning with Variation of Information , 2014, NIPS.

[32]  Louis-Philippe Morency,et al.  Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach , 2014, ICMI.

[33]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[34]  Yuting Zhang,et al.  Learning to Disentangle Factors of Variation with Manifold Interaction , 2014, ICML.

[35]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[36]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[38]  Verónica Pérez-Rosas,et al.  Utterance-Level Multimodal Sentiment Analysis , 2013, ACL.

[39]  Yale Song,et al.  Action Recognition by Hierarchical Sequence Summarization , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Björn W. Schuller,et al.  YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context , 2013, IEEE Intelligent Systems.

[41]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[43]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[45]  Masashi Sugiyama,et al.  On Kernel Parameter Selection in Hilbert-Schmidt Independence Criterion , 2012, IEICE Trans. Inf. Syst..

[46]  Yale Song,et al.  Multi-view latent variable discriminative models for action recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  S. Guimond,et al.  Intricate Correlation between Body Posture, Personality Trait and Incidence of Body Pain: A Cross-Referential Study Report , 2012, PloS one.

[48]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[49]  Rada Mihalcea,et al.  Towards multimodal sentiment analysis: harvesting opinions from the web , 2011, ICMI '11.

[50]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[51]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[52]  Alessandro Vinciarelli,et al.  The voice of personality: mapping nonverbal vocal behavior into trait attributions , 2010, SSPW '10.

[53]  Wolfgang Minker,et al.  Emotion recognition and adaptation in spoken dialogue systems , 2010, Int. J. Speech Technol..

[54]  Charalampos Bratsas,et al.  On the Classification of Emotional Biosignals Evoked While Viewing Affective Pictures: An Integrated Data-Mining-Based Approach for Healthcare Applications , 2010, IEEE Transactions on Information Technology in Biomedicine.

[55]  Sergey Levine,et al.  Real-time prosody-driven synthesis of body language , 2009, ACM Trans. Graph..

[56]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[57]  Mark Liberman,et al.  Speaker identification on the SCOTUS corpus , 2008 .

[58]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Tomoko Matsui,et al.  A Kernel for Time Series Based on Global Alignments , 2006, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[61]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[62]  P. Kuhl A new view of language acquisition. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[63]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[64]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[65]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[66]  P. Ekman An argument for basic emotions , 1992 .

[67]  P. Ekman,et al.  Facial signs of emotional experience. , 1980 .