Sentiment Analysis using Imperfect Views from Spoken Language and Acoustic Modalities

Multimodal sentiment classification in practical applications may have to rely on erroneous and imperfect views, namely (a) language transcription from a speech recognizer and (b) under-performing acoustic views. This work focuses on improving the representations of these views by performing a deep canonical correlation analysis with the representations of the better performing manual transcription view. Enhanced representations of the imperfect views can be obtained even in absence of the perfect views and give an improved performance during test conditions. Evaluations on the CMU-MOSI and CMU-MOSEI datasets demonstrate the effectiveness of the proposed approach.

[1]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[2]  Fei Su,et al.  3View deep canonical correlation analysis for cross-modal retrieval , 2015, 2015 Visual Communications and Image Processing (VCIP).

[3]  Verónica Pérez-Rosas,et al.  Utterance-Level Multimodal Sentiment Analysis , 2013, ACL.

[4]  Roland Göcke,et al.  Extending Long Short-Term Memory for Multi-View Structured Learning , 2016, ECCV.

[5]  Erik Cambria,et al.  Context-Dependent Sentiment Analysis in User-Generated Videos , 2017, ACL.

[6]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[7]  Dacheng Tao,et al.  A Survey on Multi-view Learning , 2013, ArXiv.

[8]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[9]  Sanjeev Khudanpur,et al.  JHU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[10]  Roddy Cowie,et al.  ASR for emotional speech: Clarifying the issues and enhancing performance , 2005, Neural Networks.

[11]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[12]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[13]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Imran A. Sheikh,et al.  Topic segmentation in ASR transcripts using bidirectional RNNS for change detection , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[15]  Sen Wang,et al.  Multimodal sentiment analysis with word-level fusion and reinforcement learning , 2017, ICMI.

[16]  Louis-Philippe Morency,et al.  MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos , 2016, ArXiv.

[17]  Jürgen Schmidhuber,et al.  Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks , 2007, NIPS.

[18]  Jeff A. Bilmes,et al.  Unsupervised learning of acoustic features via deep canonical correlation analysis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Sunil Kumar Kopparapu,et al.  Analyzing Emotion in Spontaneous Speech , 2018, Springer Singapore.

[20]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[21]  Louis-Philippe Morency,et al.  Deep multimodal fusion for persuasiveness prediction , 2016, ICMI.

[22]  J. Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM networks , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[23]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..