Temporally Selective Attention Model for Social and Affective State Recognition in Multimedia Content

The sheer amount of human-centric multimedia content has led to increased research on human behavior understanding. Most existing methods model behavioral sequences without considering the temporal saliency. This work is motivated by the psychological observation that temporally selective attention enables the human perceptual system to process the most relevant information. In this paper, we introduce a new approach, named Temporally Selective Attention Model (TSAM), designed to selectively attend to salient parts of human-centric video sequences. Our TSAM models learn to recognize affective and social states using a new loss function called speaker-distribution loss. Extensive experiments show that our model achieves the state-of-the-art performance on rapport detection and multimodal sentiment analysis. We also show that our speaker-distribution loss function can generalize to other computational models, improving the prediction performance of deep averaging network and Long Short Term Memory (LSTM).

[1]  Maite Taboada,et al.  Lexicon-Based Methods for Sentiment Analysis , 2011, CL.

[2]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[3]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Christopher Joseph Pal,et al.  EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.

[5]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[6]  Eric P. Xing,et al.  Select-additive learning: Improving generalization in multimodal sentiment analysis , 2016, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[7]  Ran Zhao,et al.  Towards a Dyadic Computational Model of Rapport Management for Human-Virtual Agent Interaction , 2014, IVA.

[8]  Kiyoaki Shirai,et al.  PhraseRNN: Phrase Recursive Neural Network for Aspect-based Sentiment Analysis , 2015, EMNLP.

[9]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[10]  Chen Chen,et al.  Emotion in Context: Deep Semantic Feature Fusion for Video Emotion Recognition , 2016, ACM Multimedia.

[11]  Bo Sun,et al.  LSTM for dynamic emotion and group emotion recognition in the wild , 2016, ICMI.

[12]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[13]  N. Ellouze,et al.  COMPARISON BETWEEN GMM-SVM SEQUENCE KERNEL AND GMM : APPLICATION TO SPEECH EMOTION RECOGNITION , 2015 .

[14]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[15]  Louis-Philippe Morency,et al.  Temporal Attention-Gated Model for Robust Sequence Classification , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Louis-Philippe Morency,et al.  EmoReact: a multimodal approach and dataset for recognizing emotional responses in children , 2016, ICMI.

[17]  Björn W. Schuller,et al.  YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context , 2013, IEEE Intelligent Systems.

[18]  Rada Mihalcea,et al.  Towards multimodal sentiment analysis: harvesting opinions from the web , 2011, ICMI '11.

[19]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[20]  Tobias Baur,et al.  Measuring the impact of multimodal behavioural feedback loops on social interactions , 2016, ICMI.

[21]  Emily Mower Provost,et al.  Emotion spotting: discovering regions of evidence in audio-visual emotion expressions , 2016, ICMI.

[22]  K. Scherer What are emotions? And how can they be measured? , 2005 .

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Ran Zhao,et al.  Socially-Aware Virtual Agents: Automatically Assessing Dyadic Rapport from Temporal Patterns of Behavior , 2016, IVA.

[25]  Yue Gao,et al.  Predicting Personalized Emotion Perceptions of Social Images , 2016, ACM Multimedia.

[26]  Lisa D. Sanders,et al.  Temporally selective attention modulates early perceptual processing: Event-related potential evidence , 2008, Perception & psychophysics.

[27]  Amaia Salvador,et al.  Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks , 2016, NIPS 2016.

[28]  V. Manera,et al.  Automatic speech analysis for the assessment of patients with predementia and Alzheimer's disease , 2015, Alzheimer's & dementia.

[29]  Christine L. Lisetti,et al.  MAUI: a multimodal affective user interface , 2002, MULTIMEDIA '02.

[30]  Noah A. Smith,et al.  Transition-Based Dependency Parsing with Stack Long Short-Term Memory , 2015, ACL.

[31]  Louis-Philippe Morency,et al.  Representation Learning for Speech Emotion Recognition , 2016, INTERSPEECH.

[32]  Louis-Philippe Morency,et al.  MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos , 2016, ArXiv.

[33]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[34]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Hal Daumé,et al.  Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[36]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[37]  Ming Zhou,et al.  Adaptive Recursive Neural Network for Target-dependent Twitter Sentiment Classification , 2014, ACL.

[38]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Hongliang Yu,et al.  Identifying Sentiment Words Using an Optimization Model with L1 Regularization , 2016, AAAI.

[40]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[41]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[42]  Erik Cambria,et al.  Fusing audio, visual and textual clues for sentiment analysis from multimodal content , 2016, Neurocomputing.

[43]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2009, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Erik Cambria,et al.  Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[45]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[46]  Ting Liu,et al.  Aspect Level Sentiment Classification with Deep Memory Network , 2016, EMNLP.

[47]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[48]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Louis-Philippe Morency,et al.  Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages , 2016, IEEE Intelligent Systems.

[50]  Stefan Scherer,et al.  Getting to know you: a multimodal investigation of team behavior and resilience to stress , 2016, ICMI.

[51]  Yi-Ping Phoebe Chen,et al.  Acoustic feature selection for automatic emotion recognition from speech , 2009, Inf. Process. Manag..

[52]  Eric P. Xing,et al.  Select-Additive Learning: Improving Cross-individual Generalization in Multimodal Sentiment Analysis , 2016, ArXiv.

[53]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[54]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[55]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Ming Zhou,et al.  Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification , 2014, ACL.

[57]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[58]  Verónica Pérez-Rosas,et al.  Utterance-Level Multimodal Sentiment Analysis , 2013, ACL.

[59]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[60]  Jukka Hyönä,et al.  Dynamic binding of identity and location information: A serial model of multiple identity tracking , 2008, Cognitive Psychology.