论文信息 - Video Affective Effects Prediction with Multi-modal Fusion and Shot-Long Temporal Context

Video Affective Effects Prediction with Multi-modal Fusion and Shot-Long Temporal Context

Predicting the emotional impact of videos using machine learning is a challenging task considering the varieties of modalities, the complicated temporal contex of the video as well as the time dependency of the emotional states. Feature extraction, multi-modal fusion and temporal context fusion are crucial stages for predicting valence and arousal values in the emotional impact, but have not been successfully exploited. In this paper, we propose a comprehensive framework with novel designs of modal structure and multi-modal fusion strategy. We select the most suitable modalities for valence and arousal tasks respectively and each modal feature is extracted using the modality-specific pre-trained deep model on large generic dataset. Two-time-scale structures, one for the intra-clip and the other for the inter-clip, are proposed to capture the temporal dependency of video content and emotion states. To combine the complementary information from multiple modalities, an effective and efficient residual-based progressive training strategy is proposed. Each modality is step-wisely combined into the multi-modal model, responsible for completing the missing parts of features. With all those improvements above, our proposed prediction framework achieves better performance on the LIRIS-ACCEDE dataset with a large margin compared to the state-of-the-art.

Wu Wei | Yin Zhao | Jie Zhang | Chaoping Tu | Longjun Cai

[1] Yuanliu Liu,et al. Video-based emotion recognition using CNN-RNN and C3D hybrid networks , 2016, ICMI.

[2] Kristen A. Lindquist,et al. Of Mice and Men: Natural Kinds of Emotions in the Mammalian Brain? A Response to Panksepp and Izard , 2007, Perspectives on psychological science : a journal of the Association for Psychological Science.

[3] Ting Liu,et al. GLA in MediaEval 2018 Emotional Impact of Movies Task , 2018, MediaEval.

[4] Zhonglei Gu,et al. Towards Learning Emotional Subspace , 2018, MediaEval.

[5] Tsuyoshi Murata,et al. {m , 1934, ACML.

[6] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[7] Minghao Wang,et al. Multi-Feature Based Emotion Recognition for Video Clips , 2018, ICMI.

[8] C. Izard. Basic Emotions, Natural Kinds, Emotion Schemas, and a New Paradigm , 2007, Perspectives on psychological science : a journal of the Association for Psychological Science.

[9] Chong-Wah Ngo,et al. Deep Multimodal Learning for Affective Analysis and Retrieval , 2015, IEEE Transactions on Multimedia.

[10] Junping Du,et al. Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[12] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[13] Rainer Stiefelhagen,et al. KIT at MediaEval 2015 - Evaluating Visual Cues for Affective Impact of Movies Task , 2015, MediaEval.

[14] Ehtesham Hassan,et al. TCS-ILAB - MediaEval 2015: Affective Impact of Movies and Violent Scene Detection , 2015, MediaEval.

[15] The ICL-TUM-PASSAU Approach for the MediaEval 2015 "Affective Impact of Movies" Task , 2015, MediaEval.

[17] Loong Fah Cheong,et al. Affective understanding in film , 2006, IEEE Trans. Circuits Syst. Video Technol..

[18] Vitomir Štruc,et al. Towards Efficient Multi-Modal Emotion Recognition , 2013 .

[19] Angeliki Metallinou,et al. Decision level combination of multiple modalities for recognition and analysis of emotional expression , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21] Thierry Dutoit,et al. UMons at MediaEval 2015 Affective Impact of Movies Task including Violent Scenes Detection , 2015, MediaEval.

[22] Mohan S. Kankanhalli,et al. Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[23] Vanessa Testoni,et al. RECOD at MediaEval 2015: Affective Impact of Movies Task , 2015, MediaEval.

[24] Takao Kobayashi,et al. Speech emotion recognition using convolutional long short-term memory neural network and support vector machines , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[25] Markus Schedl,et al. RFA at MediaEval 2015 Affective Impact of Movies Task: A Multimodal Approach , 2015, MediaEval.

[26] Hanli Wang,et al. CNN Features for Emotional Impact of Movies Task , 2018, MediaEval.

[27] Yiannis Kompatsiaris,et al. Visual and Audio Analysis of Movies Video for Emotion Detection @ Emotional Impact of Movies Task MediaEval 2018 , 2018, MediaEval.

[28] Yu Qiao,et al. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[29] Björn W. Schuller,et al. Low-Level Fusion of Audio, Video Feature for Multi-Modal Emotion Recognition , 2008, VISAPP.

[30] B. Detenber,et al. A Bio‐Informational Theory of Emotion: Motion and Image Size Effects on Viewers , 1996 .

[31] Mann Oo. Hay. Emotion recognition in human-computer interaction , 2012 .

[32] François Chollet,et al. Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[34] Bowen Zhang,et al. MIC-TJU in MediaEval 2015 Affective Impact of Movies Task , 2015, MediaEval.

[35] Vinh-Tiep Nguyen,et al. Frame-based Evaluation with Deep Features to Predict Emotional Impact of Movies , 2018, MediaEval.

[36] John B. Shoven,et al. I , Edinburgh Medical and Surgical Journal.

[37] Verónica Pérez-Rosas,et al. Multimodal Sentiment Analysis of Spanish Online Videos , 2013, IEEE Intelligent Systems.

[38] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[39] Emmanuel Dellandréa,et al. The MediaEval 2015 Affective Impact of Movies Task , 2015, MediaEval.

[40] Xi Wang,et al. Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning , 2015, MediaEval.

[41] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[42] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[43] Qiang Ji,et al. A Multimodal Deep Regression Bayesian Network for Affective Video Content Analyses , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44] Mingxing Xu,et al. THUHCSI in MediaEval 2018 Emotional Impact of Movies Task , 2018, MediaEval.