A Multi-Task Neural Approach for Emotion Attribution, Classification, and Summarization

Emotional content is a crucial ingredient in user-generated videos. However, the sparsity of emotional expressions in the videos poses an obstacle to visual emotion analysis. In this paper, we propose a new neural approach, Bi-stream Emotion Attribution-Classification Network (BEAC-Net), to solve three related emotion analysis tasks: emotion recognition, emotion attribution, and emotion-oriented summarization, in a single integrated framework. BEAC-Net has two major constituents, an attribution network and a classification network. The attribution network extracts the main emotional segment that classification should focus on in order to mitigate the sparsity issue. The classification network utilizes both the extracted segment and the original video in a bi-stream architecture. We contribute a new dataset for the emotion attribution task with human-annotated ground-truth labels for emotion segments. Experiments on two video datasets demonstrate superior performance of the proposed framework and the complementary nature of the dual classification streams.

[1]  Yale Song,et al.  Learning a sparse codebook of facial and body microexpressions for emotion recognition , 2013, ICMI '13.

[2]  Shih-Fu Chang,et al.  Predicting Viewer Perceived Emotions in Animated GIFs , 2014, ACM Multimedia.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Tanaya Guha,et al.  Unsupervised Discovery of Character Dictionaries in Animation Movies , 2018, IEEE Transactions on Multimedia.

[5]  Xiangyang Xue,et al.  Frame-Transformer Emotion Classification Network , 2017, ICMR.

[6]  Yong Jae Lee,et al.  End-to-End Localization and Ranking for Relative Attributes , 2016, ECCV.

[7]  Tao Chen,et al.  DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks , 2014, ArXiv.

[8]  Wen Gao,et al.  Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching , 2018, IEEE Transactions on Multimedia.

[9]  Meng Wang,et al.  Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification , 2012, IEEE Transactions on Multimedia.

[10]  Kristen A. Lindquist,et al.  The hundred-year emotion war: are emotions natural kinds or psychological constructions? Comment on Lench, Flores, and Bench (2011). , 2013, Psychological bulletin.

[11]  J. Hietanen,et al.  Bodily maps of emotions , 2013, Proceedings of the National Academy of Sciences.

[12]  J. Russell A circumplex model of affect. , 1980 .

[13]  Emmanuel Dellandréa,et al.  LIRIS-ACCEDE: A Video Database for Affective Content Analysis , 2015, IEEE Transactions on Affective Computing.

[14]  Qiang Ji,et al.  Video Affective Content Analysis: A Survey of State-of-the-Art Methods , 2015, IEEE Transactions on Affective Computing.

[15]  Shizhe Chen,et al.  Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition , 2017, AVEC@ACM Multimedia.

[16]  Boyang Li,et al.  Video Emotion Recognition with Transferred Deep Feature Encodings , 2016, ICMR.

[17]  Loong Fah Cheong,et al.  Affective understanding in film , 2006, IEEE Trans. Circuits Syst. Video Technol..

[18]  Xiangyang Xue,et al.  Predicting Emotions in User-Generated Videos , 2014, AAAI.

[19]  Peter Y. K. Cheung,et al.  Affective Level Video Segmentation by Utilizing the Pleasure-Arousal-Dominance Information , 2008, IEEE Transactions on Multimedia.

[20]  R. Plutchik,et al.  Emotion: Theory, Research, and Experience. Vol. 1. Theories of Emotion , 1981 .

[21]  Tsuhan Chen,et al.  A mixed bag of emotions: Model, predict, and transfer emotion distributions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yang Yi,et al.  Key frame extraction based on visual attention model , 2012, J. Vis. Commun. Image Represent..

[23]  Aggelos K. Katsaggelos,et al.  MINMAX optimal video summarization , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[24]  Jiebo Luo,et al.  Robust Image Sentiment Analysis Using Progressively Trained and Domain Transferred Deep Networks , 2015, AAAI.

[25]  Rosanna E. Guadagno,et al.  What makes a video go viral? An analysis of emotional contagion and Internet memes , 2013, Comput. Hum. Behav..

[26]  Xiao-Ping Zhang,et al.  Learning a hierarchical spatio-temporal model for human activity recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  H. Lövheim A new three-dimensional model for emotions and monoamine neurotransmitters. , 2012, Medical hypotheses.

[28]  Liming Chen,et al.  Muscular Movement Model-Based Automatic 3D/4D Facial Expression Recognition , 2015, IEEE Transactions on Multimedia.

[29]  Chong-Wah Ngo,et al.  Deep Multimodal Learning for Affective Analysis and Retrieval , 2015, IEEE Transactions on Multimedia.

[30]  Nicu Sebe,et al.  Recognizing Emotions from Abstract Paintings Using Non-Linear Matrix Completion , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[32]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Janet Palmer,et al.  Affective guidance of intelligent agents: How emotion controls cognition , 2009, Cognitive Systems Research.

[34]  J. Gross Emotion regulation: affective, cognitive, and social consequences. , 2002, Psychophysiology.

[35]  K. Scherer,et al.  The World of Emotions is not Two-Dimensional , 2007, Psychological science.

[36]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Frank Hopfgartner,et al.  A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material , 2016, Multimedia Tools and Applications.

[38]  Chong-Wah Ngo,et al.  Summarizing Rushes Videos by Motion, Object, and Event Understanding , 2012, IEEE Transactions on Multimedia.

[39]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[40]  Riccardo Leonardi,et al.  A Connotative Space for Supporting Movie Affective Recommendation , 2011, IEEE Transactions on Multimedia.

[41]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[42]  Nicu Sebe,et al.  Exploiting facial expressions for affective video summarisation , 2009, CIVR '09.

[43]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[44]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[45]  Razvan Pascanu,et al.  Combining modality specific deep neural networks for emotion recognition in video , 2013, ICMI '13.

[46]  Ba Tu Truong,et al.  Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[47]  Boyang Li,et al.  PlotShot: Generating Discourse-Constrained Stories Around Photos , 2016, AIIDE.

[48]  Boyang Li,et al.  A Dynamic and Dual-Process Theory of Humor , 2015 .

[49]  C. W. Hughes Emotion: Theory, Research and Experience , 1982 .

[50]  Jian Huang,et al.  Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network , 2017, AVEC@ACM Multimedia.

[51]  S. Sclaroff,et al.  Web-Based Classifiers for Human Action Recognition , 2012, IEEE Transactions on Multimedia.

[52]  Rongrong Ji,et al.  Large-scale visual sentiment ontology and detectors using adjective noun pairs , 2013, ACM Multimedia.

[53]  Shiguang Shan,et al.  Learning Expressionlets on Spatio-temporal Manifold for Dynamic Facial Expression Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  A. Damasio Descartes' error: emotion, reason, and the human brain. avon books , 1994 .

[55]  Boyang Li,et al.  Heterogeneous Knowledge Transfer in Video Emotion Recognition, Attribution and Summarization , 2015, IEEE Transactions on Affective Computing.

[56]  James Ze Wang,et al.  On shape and the computability of emotions , 2012, ACM Multimedia.

[57]  Xiao-Ping Zhang,et al.  A Hierarchical Spatio-Temporal Model for Human Activity Recognition , 2017, IEEE Transactions on Multimedia.

[58]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[59]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[60]  Harish Katti,et al.  CAVVA: Computational Affective Video-in-Video Advertising , 2014, IEEE Transactions on Multimedia.

[61]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Touradj Ebrahimi,et al.  Affective content analysis of music video clips , 2011, MIRUM '11.

[63]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[64]  P. Ekman Universals and cultural differences in facial expressions of emotion. , 1972 .

[65]  Zhihong Zeng,et al.  Audio-Visual Affect Recognition , 2007, IEEE Transactions on Multimedia.

[66]  Bing Li,et al.  Multi-Perspective Cost-Sensitive Context-Aware Multi-Instance Sparse Coding and Its Application to Sensitive Video Recognition , 2016, IEEE Transactions on Multimedia.

[67]  Xi Wang,et al.  Real-time summarization of user-generated videos based on semantic recognition , 2014, ACM Multimedia.

[68]  Allan Hanbury,et al.  Affective image classification using features inspired by psychology and art theory , 2010, ACM Multimedia.

[69]  Simon Lucey,et al.  Inverse Compositional Spatial Transformer Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Rongrong Ji,et al.  Video indexing and recommendation based on affective analysis of viewers , 2011, MM '11.

[71]  L. F. Barrett Are Emotions Natural Kinds? , 2006, Perspectives on psychological science : a journal of the Association for Psychological Science.

[72]  T. Dalgleish Basic Emotions , 2004 .