Sentiment Analysis using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities

Sentiment analysis research has been rapidly developing in the last decade and has attracted widespread attention from academia and industry, most of which is based on text. However, the information in the real world usually comes as different modalities. In this paper, we consider the task of Multimodal Sentiment Analysis, using Audio and Text Modalities, proposed a novel fusion strategy including Multi-Feature Fusion and Multi-Modality Fusion to improve the accuracy of Audio-Text Sentiment Analysis. We call this the Deep Feature Fusion-Audio and Text Modal Fusion (DFF-ATMF) model, and the features learned from it are complementary to each other and robust. Experiments with the CMU-MOSI corpus and the recently released CMU-MOSEI corpus for Youtube video sentiment analysis show the very competitive results of our proposed model. Surprisingly, our method also achieved the state-of-the-art results in the IEMOCAP dataset, indicating that our proposed fusion strategy is also extremely generalization ability to Multimodal Emotion Recognition.

[1]  Francisco Herrera,et al.  Distinguishing between facts and opinions for sentiment analysis: Survey and challenges , 2018, Inf. Fusion.

[2]  Yiming Yang,et al.  Transformer-XL: Language Modeling with Longer-Term Dependency , 2018 .

[3]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[4]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[5]  Erik Cambria,et al.  Context-Dependent Sentiment Analysis in User-Generated Videos , 2017, ACL.

[6]  Louis-Philippe Morency,et al.  MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos , 2016, ArXiv.

[7]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[8]  Björn W. Schuller,et al.  Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Chan Woo Lee,et al.  Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data , 2018, ArXiv.

[10]  R. Kozinets,et al.  Evolving netnography: how brand auto-netnography, a netnographic sensibility, and more-than-human netnography can transform your research , 2018 .

[11]  Pushpak Bhattacharyya,et al.  Contextual Inter-modal Attention for Multi-modal Sentiment Analysis , 2018, EMNLP.

[12]  Roger Zimmermann,et al.  Learning and Fusing Multimodal Deep Features for Acoustic Scene Categorization , 2018, ACM Multimedia.

[13]  Ivan Tashev,et al.  Convolutional Neural Network Techniques for Speech Emotion Recognition , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[14]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[15]  Shuai Wang,et al.  Deep learning for sentiment analysis: A survey , 2018, WIREs Data Mining Knowl. Discov..

[16]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[17]  Erik Cambria,et al.  Multi-level Multiple Attentions for Contextual Multimodal Sentiment Analysis , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[18]  Shervin Minaee,et al.  Deep-Emotion: Facial Expression Recognition Using Attentional Convolutional Network , 2019, Sensors.

[19]  Chuhan Wu,et al.  THU_NGN at SemEval-2018 Task 1: Fine-grained Tweet Sentiment Intensity Analysis with Attention CNN-LSTM , 2018, *SEMEVAL.

[20]  Kyomin Jung,et al.  Multimodal Speech Emotion Recognition Using Audio and Text , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[21]  Zhifeng Hao,et al.  Multi-view and Attention-Based BI-LSTM for Weibo Emotion Recognition , 2018 .

[22]  Doaa Mohey El Din Mohamed Hussein,et al.  A survey on sentiment analysis challenges , 2016, Journal of King Saud University - Engineering Sciences.

[23]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[24]  Pascale Fung,et al.  A first look into a Convolutional Neural Network for speech emotion detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Erik Cambria,et al.  Multimodal Sentiment Analysis: Addressing Key Issues and Setting Up the Baselines , 2018, IEEE Intelligent Systems.

[26]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[27]  Gui Xiaolin,et al.  Deep Convolution Neural Networks for Twitter Sentiment Analysis , 2018, IEEE Access.

[28]  Erik Cambria,et al.  Combining Textual Clues with Audio-Visual Information for Multimodal Sentiment Analysis , 2018 .

[29]  Jun-Wei Mao,et al.  Speech emotion recognition based on feature selection and extreme learning machine decision tree , 2018, Neurocomputing.

[30]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[31]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[33]  Hua Xu,et al.  Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network , 2018, AffCon@AAAI.

[34]  Erik Cambria,et al.  Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling , 2018, Knowl. Based Syst..