论文信息 - Multi-attention Recurrent Network for Human Communication Comprehension

Multi-attention Recurrent Network for Human Communication Comprehension

Human face-to-face communication is a complex multimodal signal. We use words (language modality), gestures (vision modality) and changes in tone (acoustic modality) to convey our intentions. Humans easily process and understand face-to-face communication, however, comprehending this form of communication remains a significant challenge for Artificial Intelligence (AI). AI must understand each modality and the interactions between them that shape the communication. In this paper, we present a novel neural architecture for understanding human communication called the Multi-attention Recurrent Network (MARN). The main strength of our model comes from discovering interactions between modalities through time using a neural component called the Multi-attention Block (MAB) and storing them in the hybrid memory of a recurrent component called the Long-short Term Hybrid Memory (LSTHM). We perform extensive comparisons on six publicly available datasets for multimodal sentiment analysis, speaker trait recognition and emotion recognition. MARN shows state-of-the-art results performance in all the datasets.

[1] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[2] Eric P. Xing,et al. Select-additive learning: Improving generalization in multimodal sentiment analysis , 2016, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[3] Erik Cambria,et al. Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis , 2015, EMNLP.

[4] John Kane,et al. COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Yale Song,et al. Action Recognition by Hierarchical Sequence Summarization , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6] Mark Liberman,et al. Speaker identification on the SCOTUS corpus , 2008 .

[7] J Sergent,et al. Functional and anatomical decomposition of face processing: evidence from prosopagnosia and PET study of normal subjects. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[8] Louis-Philippe Morency,et al. Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages , 2016, IEEE Intelligent Systems.

[9] Carlos Busso,et al. IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[10] Takeo Kanade,et al. Facial Expression Analysis , 2011, AMFG.

[11] Eric P. Xing,et al. Select-Additive Learning: Improving Cross-individual Generalization in Multimodal Sentiment Analysis , 2016, ArXiv.

[12] Erik Cambria,et al. Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[13] Björn W. Schuller,et al. YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context , 2013, IEEE Intelligent Systems.

[14] Rada Mihalcea,et al. Towards multimodal sentiment analysis: harvesting opinions from the web , 2011, ICMI '11.

[15] Trevor Darrell,et al. Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[16] Louis-Philippe Morency,et al. Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach , 2014, ICMI.

[18] Chaozhe Zhu,et al. Neural Synchronization during Face-to-Face Communication , 2012, The Journal of Neuroscience.

[19] Yale Song,et al. Multi-view latent variable discriminative models for action recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20] Erik Cambria,et al. Context-Dependent Sentiment Analysis in User-Generated Videos , 2017, ACL.

[21] Torsten Wörtwein,et al. What really matters — An information gain analysis of questions and reactions in automated PTSD screenings , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[22] Louis-Philippe Morency,et al. Deep multimodal fusion for persuasiveness prediction , 2016, ICMI.

[23] Trevor Darrell,et al. Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[25] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[26] Kai Vogeley,et al. Imaging first impressions: Distinct neural processing of verbal and nonverbal social information , 2012, NeuroImage.

[27] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[28] Dacheng Tao,et al. A Survey on Multi-view Learning , 2013, ArXiv.

[29] Verónica Pérez-Rosas,et al. Utterance-Level Multimodal Sentiment Analysis , 2013, ACL.

[30] Roland Göcke,et al. Extending Long Short-Term Memory for Multi-View Structured Learning , 2016, ECCV.

[31] Suryakanth V. Gangashetty,et al. Multimodal Sentiment Analysis Using Deep Neural Networks , 2016, MIKE.

[32] L. Baum,et al. Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[33] Erik Cambria,et al. Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).