Video Affective Effects Prediction with Multi-modal Fusion and Shot-Long Temporal Context

Predicting the emotional impact of videos using machine learning is a challenging task considering the varieties of modalities, the complicated temporal contex of the video as well as the time dependency of the emotional states. Feature extraction, multi-modal fusion and temporal context fusion are crucial stages for predicting valence and arousal values in the emotional impact, but have not been successfully exploited. In this paper, we propose a comprehensive framework with novel designs of modal structure and multi-modal fusion strategy. We select the most suitable modalities for valence and arousal tasks respectively and each modal feature is extracted using the modality-specific pre-trained deep model on large generic dataset. Two-time-scale structures, one for the intra-clip and the other for the inter-clip, are proposed to capture the temporal dependency of video content and emotion states. To combine the complementary information from multiple modalities, an effective and efficient residual-based progressive training strategy is proposed. Each modality is step-wisely combined into the multi-modal model, responsible for completing the missing parts of features. With all those improvements above, our proposed prediction framework achieves better performance on the LIRIS-ACCEDE dataset with a large margin compared to the state-of-the-art.

[1]  Yuanliu Liu,et al.  Video-based emotion recognition using CNN-RNN and C3D hybrid networks , 2016, ICMI.

[2]  Kristen A. Lindquist,et al.  Of Mice and Men: Natural Kinds of Emotions in the Mammalian Brain? A Response to Panksepp and Izard , 2007, Perspectives on psychological science : a journal of the Association for Psychological Science.

[3]  Ting Liu,et al.  GLA in MediaEval 2018 Emotional Impact of Movies Task , 2018, MediaEval.

[4]  Zhonglei Gu,et al.  Towards Learning Emotional Subspace , 2018, MediaEval.

[5]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Minghao Wang,et al.  Multi-Feature Based Emotion Recognition for Video Clips , 2018, ICMI.

[8]  C. Izard Basic Emotions, Natural Kinds, Emotion Schemas, and a New Paradigm , 2007, Perspectives on psychological science : a journal of the Association for Psychological Science.

[9]  Chong-Wah Ngo,et al.  Deep Multimodal Learning for Affective Analysis and Retrieval , 2015, IEEE Transactions on Multimedia.

[10]  Junping Du,et al.  Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[12]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[13]  Rainer Stiefelhagen,et al.  KIT at MediaEval 2015 - Evaluating Visual Cues for Affective Impact of Movies Task , 2015, MediaEval.

[14]  Ehtesham Hassan,et al.  TCS-ILAB - MediaEval 2015: Affective Impact of Movies and Violent Scene Detection , 2015, MediaEval.

[15]  The ICL-TUM-PASSAU Approach for the MediaEval 2015 "Affective Impact of Movies" Task , 2015, MediaEval.

[17]  Loong Fah Cheong,et al.  Affective understanding in film , 2006, IEEE Trans. Circuits Syst. Video Technol..

[18]  Vitomir Štruc,et al.  Towards Efficient Multi-Modal Emotion Recognition , 2013 .

[19]  Angeliki Metallinou,et al.  Decision level combination of multiple modalities for recognition and analysis of emotional expression , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Thierry Dutoit,et al.  UMons at MediaEval 2015 Affective Impact of Movies Task including Violent Scenes Detection , 2015, MediaEval.

[22]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[23]  Vanessa Testoni,et al.  RECOD at MediaEval 2015: Affective Impact of Movies Task , 2015, MediaEval.

[24]  Takao Kobayashi,et al.  Speech emotion recognition using convolutional long short-term memory neural network and support vector machines , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[25]  Markus Schedl,et al.  RFA at MediaEval 2015 Affective Impact of Movies Task: A Multimodal Approach , 2015, MediaEval.

[26]  Hanli Wang,et al.  CNN Features for Emotional Impact of Movies Task , 2018, MediaEval.

[27]  Yiannis Kompatsiaris,et al.  Visual and Audio Analysis of Movies Video for Emotion Detection @ Emotional Impact of Movies Task MediaEval 2018 , 2018, MediaEval.

[28]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[29]  Björn W. Schuller,et al.  Low-Level Fusion of Audio, Video Feature for Multi-Modal Emotion Recognition , 2008, VISAPP.

[30]  B. Detenber,et al.  A Bio‐Informational Theory of Emotion: Motion and Image Size Effects on Viewers , 1996 .

[31]  Mann Oo. Hay Emotion recognition in human-computer interaction , 2012 .

[32]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[34]  Bowen Zhang,et al.  MIC-TJU in MediaEval 2015 Affective Impact of Movies Task , 2015, MediaEval.

[35]  Vinh-Tiep Nguyen,et al.  Frame-based Evaluation with Deep Features to Predict Emotional Impact of Movies , 2018, MediaEval.

[36]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[37]  Verónica Pérez-Rosas,et al.  Multimodal Sentiment Analysis of Spanish Online Videos , 2013, IEEE Intelligent Systems.

[38]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Emmanuel Dellandréa,et al.  The MediaEval 2015 Affective Impact of Movies Task , 2015, MediaEval.

[40]  Xi Wang,et al.  Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning , 2015, MediaEval.

[41]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[42]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Qiang Ji,et al.  A Multimodal Deep Regression Bayesian Network for Affective Video Content Analyses , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Mingxing Xu,et al.  THUHCSI in MediaEval 2018 Emotional Impact of Movies Task , 2018, MediaEval.