Multimodal Routing: Improving Local and Global Interpretability of Multimodal Language Analysis

The human language can be expressed through multiple sources of information known as modalities, including tones of voice, facial gestures, and spoken language. Recent multimodal learning with strong performances on human-centric tasks such as sentiment analysis and emotion recognition are often black-box, with very limited interpretability. In this paper we propose Multimodal Routing, which dynamically adjusts weights between input modalities and output representations differently for each input sample. Multimodal routing can identify relative importance of both individual modalities and cross-modality features. Moreover, the weight assignment by routing allows us to interpret modality-prediction relationships not only globally (i.e. general trends over the whole dataset), but also locally for each single input sample, mean-while keeping competitive performance compared to state-of-the-art methods.

[1]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Karl J. Friston,et al.  A multimodal language region in the ventral visual pathway , 1998, Nature.

[4]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[6]  G. Hommel,et al.  Confidence interval or p-value?: part 4 of a series on evaluation of scientific publications. , 2009, Deutsches Arzteblatt international.

[7]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[8]  J. Ranstam Why the P-value culture is bad and confidence intervals a better alternative. , 2012, Osteoarthritis and cartilage.

[9]  S. Scott,et al.  When voices get emotional: A corpus of nonverbal vocalizations for research on emotion processing , 2013, Behavior research methods.

[10]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[11]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[13]  Oliver G. B. Garrod,et al.  Dynamic Facial Expressions of Emotion Transmit an Evolving Hierarchy of Signals over Time , 2014, Current Biology.

[14]  R. Tibshirani,et al.  Generalized Additive Models , 1986 .

[15]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[18]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[19]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[20]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[21]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[22]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[23]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[24]  Firoj Alam,et al.  CrisisMMD: Multimodal Twitter Datasets from Natural Disasters , 2018, ICWSM.

[25]  Homayoon S. M. Beigi,et al.  Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning , 2018, ArXiv.

[26]  Geoffrey E. Hinton,et al.  Matrix capsules with EM routing , 2018, ICLR.

[27]  Le Song,et al.  Learning to Explain: An Information-Theoretic Perspective on Model Interpretation , 2018, ICML.

[28]  Louis-Philippe Morency,et al.  Efficient Low-rank Multimodal Fusion With Modality-Specific Factors , 2018, ACL.

[29]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[30]  Jennifer Williams,et al.  Recognizing Emotions in Video Using Multimodal DNN Feature Fusion , 2018 .

[31]  Ruslan Salakhutdinov,et al.  Learning Factorized Multimodal Representations , 2018, ICLR.

[32]  Eric Granger,et al.  Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion Recognition , 2019, ArXiv.

[33]  Louis-Philippe Morency,et al.  Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors , 2018, AAAI.

[34]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Ruslan Salakhutdinov,et al.  Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[36]  Chuang Gan,et al.  The Neuro-Symbolic Concept Learner: Interpreting Scenes Words and Sentences from Natural Supervision , 2019, ICLR.