论文信息 - Tensor Fusion Network for Multimodal Sentiment Analysis

Tensor Fusion Network for Multimodal Sentiment Analysis

Multimodal sentiment analysis is an increasingly popular research area, which extends the conventional language-based definition of sentiment analysis to a multimodal setup where other relevant modalities accompany language. In this paper, we pose the problem of multimodal sentiment analysis as modeling intra-modality and inter-modality dynamics. We introduce a novel model, termed Tensor Fusion Network, which learns both such dynamics end-to-end. The proposed approach is tailored for the volatile nature of spoken language in online videos as well as accompanying gestures and voice. In the experiments, our model outperforms state-of-the-art approaches for both multimodal and unimodal sentiment analysis.

[1] Louis-Philippe Morency,et al. Representation Learning for Speech Emotion Recognition , 2016, INTERSPEECH.

[2] Keun-Chang Kwak,et al. Facial Expression Recognition Using 3D Convolutional Neural Network , 2014 .

[3] Maite Taboada,et al. Lexicon-Based Methods for Sentiment Analysis , 2011, CL.

[4] Markus Kächele,et al. Multiple Classifier Systems for the Classification of Audio-Visual Emotional States , 2011, ACII.

[5] Bing Liu,et al. Mining and summarizing customer reviews , 2004, KDD.

[6] Hal Daumé,et al. Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[7] Jinkyu Lee,et al. High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[8] Louis-Philippe Morency,et al. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos , 2016, ArXiv.

[9] Paavo Alku,et al. Parabolic spectral parameter - A new method for quantification of the glottal flow , 1997, Speech Commun..

[10] Navneet Kaur,et al. Opinion mining and sentiment analysis , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[11] Louis-Philippe Morency,et al. Combating Human Trafficking with Deep Multimodal Models. , 2017 .

[12] Erik Cambria,et al. Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis , 2015, EMNLP.

[13] Amir Zadeh,et al. Micro-opinion Sentiment Intensity Analysis and Summarization in Online Videos , 2015, ICMI.

[14] Eric P. Xing,et al. Select-Additive Learning: Improving Cross-individual Generalization in Multimodal Sentiment Analysis , 2016, ArXiv.

[15] Peter Robinson,et al. OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[16] Hiroya Fujisaki,et al. Proposal and evaluation of models for the glottal source waveform , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17] P. Ekman,et al. Facial signs of emotional experience. , 1980 .

[18] Louis-Philippe Morency,et al. Convolutional Experts Constrained Local Model for Facial Landmark Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[19] Jiebo Luo,et al. Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Jürgen Schmidhuber,et al. Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[21] Christopher Joseph Pal,et al. Recurrent Neural Networks for Emotion Recognition in Video , 2015, ICMI.

[22] I R Titze,et al. Vocal intensity in speakers and singers. , 1991, The Journal of the Acoustical Society of America.

[23] D G Childers,et al. Vocal quality factors: analysis, synthesis, and perception. , 1991, The Journal of the Acoustical Society of America.

[24] Namita Mittal,et al. Concept-Level Sentiment Analysis with Dependency-Based Semantic Parsing: A Novel Approach , 2015, Cognitive Computation.

[25] P. Ekman. An argument for basic emotions , 1992 .

[26] Claire Cardie,et al. Extracting Opinion Expressions with semi-Markov Conditional Random Fields , 2012, EMNLP.

[27] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[28] P. Alku,et al. Normalized amplitude quotient for parametrization of the glottal flow. , 2002, The Journal of the Acoustical Society of America.

[29] Eric P. Xing,et al. Select-additive learning: Improving generalization in multimodal sentiment analysis , 2016, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[30] John Kane,et al. COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Abeer Alwan,et al. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics , 2019, INTERSPEECH.

[32] Christopher D. Manning,et al. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[33] Wootaek Lim,et al. Speech emotion recognition using convolutional and Recurrent Neural Networks , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[34] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[35] Khalil Sima'an,et al. A Shared Task on Multimodal Machine Translation and Crosslingual Image Description , 2016, WMT.

[36] Erik Cambria,et al. Sentic patterns: Dependency-based rules for concept-level sentiment analysis , 2014, Knowl. Based Syst..

[37] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[38] Erik Cambria,et al. Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[39] John Kane,et al. Wavelet Maxima Dispersion for Breathy to Tense Voice Discrimination , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[40] Erik Cambria,et al. A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[41] Fabien Ringeval,et al. AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[42] Dipankar Das,et al. A Practical Guide to Sentiment Analysis , 2017 .

[43] Patrick A. Naylor,et al. Detection of Glottal Closure Instants From Speech Signals: A Quantitative Review , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[44] Björn W. Schuller,et al. YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context , 2013, IEEE Intelligent Systems.

[45] Rada Mihalcea,et al. Towards multimodal sentiment analysis: harvesting opinions from the web , 2011, ICMI '11.

[46] Phil Blunsom,et al. A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[47] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[48] George Trigeorgis,et al. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49] Louis-Philippe Morency,et al. Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages , 2016, IEEE Intelligent Systems.

[50] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Björn W. Schuller,et al. SenticNet 4: A Semantic Resource for Sentiment Analysis Based on Conceptual Primitives , 2016, COLING.

[52] Verónica Pérez-Rosas,et al. Utterance-Level Multimodal Sentiment Analysis , 2013, ACL.

[53] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[54] Alexander F. Gelbukh,et al. Dependency-Based Semantic Parsing for Concept-Level Text Analysis , 2014, CICLing.

[55] Paavo Alku,et al. Glottal wave analysis with Pitch Synchronous Iterative Adaptive Inverse Filtering , 1991, Speech Commun..

[56] Louis-Philippe Morency,et al. EmoReact: a multimodal approach and dataset for recognizing emotional responses in children , 2016, ICMI.