论文信息 - Deep Mul Timodal Learning for Emotion Recognition in Spoken Language

Deep Mul Timodal Learning for Emotion Recognition in Spoken Language

In this paper, we present a novel deep multimodal framework to predict human emotions based on sentence-level spoken language. Our architecture has two distinctive characteristics. First, it extracts the high-level features from both text and audio via a hybrid deep multimodal structure, which considers the spatial information from text, temporal information from audio, and high -level associations from low-level handcrafted features. Second, we fuse all features by using a three-layer deep neural network to learn the correlations across modalities and train the feature extraction and fusion modules together, allowing optimal global fine-tuning of the entire structure. We evaluated the proposed framework on the IEMOCAP dataset. Our result shows promising performance, achieving 60.4% in weighted accnrar-v for five emotion categories.

[1] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2] Zhong-Qiu Wang,et al. Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Björn W. Schuller,et al. YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context , 2013, IEEE Intelligent Systems.

[4] Ivan Marsic,et al. Language-Based Process Phase Detection in the Trauma Resuscitation , 2017, 2017 IEEE International Conference on Healthcare Informatics (ICHI).

[5] Verónica Pérez-Rosas,et al. Multimodal Sentiment Analysis of Spanish Online Videos , 2013, IEEE Intelligent Systems.

[6] Guoyong Cai,et al. Convolutional Neural Networks for Multimedia Sentiment Analysis , 2015, NLPCC.

[7] Y. X. Zou,et al. An experimental study of speech emotion recognition based on deep convolutional neural networks , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[8] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[9] Wen Gao,et al. Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[10] Carlos Busso,et al. IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[11] Björn Schuller,et al. Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[12] Jinkyu Lee,et al. High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[13] Björn W. Schuller,et al. Recognizing Affect from Linguistic Information in 3D Continuous Space , 2011, IEEE Transactions on Affective Computing.

[14] Ivan Marsic,et al. Speech Intention Classification with Multimodal Deep Learning , 2017, Canadian Conference on AI.

[15] Steven Bird,et al. NLTK: The Natural Language Toolkit , 2002, ACL.

[16] Gerald Penn,et al. Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17] Chengxin Li,et al. Speech emotion recognition with acoustic and lexical features , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Erik Cambria,et al. Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis , 2015, EMNLP.

[19] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[20] Erik Cambria,et al. Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).