A Deep-Learning Based Model for Emotional Evaluation of Video Clips

Emotional evaluation of video clips is the difficult task because it includes not only stationary objects as the background but also dynamic objects as the foreground. In addition, there are many video analysis problems to be solved beforehand to properly address the emotion-related tasks. Recently, however, the convolutional neural network (CNN)-based deep learning approach, opens the possibility by solving the action recognition problem. Inspired by the CNN-based action recognition technology, this paper challenges to evaluate the emotion of video clips. In the paper, we propose a deep learning model to capture the video features and evaluate the emotion of a video clip on Thayer 2D emotion space. In the model, the pre-trained convolutional 3D neural network (C3D) generates short-term spatiotemporal features of the video, LSTM accumulates those consecutive time-varying features to characterize long-term dynamic behaviors, and multilayer perceptron (MLP) evaluates emotion of a video clip by regression on the emotion space. Due to the limited number of labeled data, the C3D is employed to extract diverse spatiotemporal from various layers by transfer learning technique. The pre-trained C3D on the Sports-1M dataset and long short term memory (LSTM) followed by the MLP for regression are trained in end-to-end manner to fine-tune the C3D, and to adjust weights of LSTM and the MLP-type emotion estimator. The proposed method achieves the concordance correlation coefficient values of 0.6024 for valence and 0.6460 for arousal, respectively. We believe this emotional evaluation of video could be easily associated with appropriate music recommendation, once the music is emotionally evaluated in the same high-level emotional space.

[1]  Neha Jain,et al.  Hybrid deep neural networks for face emotion recognition , 2018, Pattern Recognit. Lett..

[2]  Stefanos Zafeiriou,et al.  A Multi-component CNN-RNN Approach for Dimensional Emotion Recognition in-the-wild , 2018, ArXiv.

[3]  Shahrokh Valaee,et al.  Recent Advances in Recurrent Neural Networks , 2017, ArXiv.

[4]  Qiang Ji,et al.  A Multimodal Deep Regression Bayesian Network for Affective Video Content Analyses , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Frédéric Jurie,et al.  Temporal multimodal fusion for video emotion classification in the wild , 2017, ICMI.

[6]  Ligang Zhang,et al.  Synchronous prediction of arousal and valence using LSTM network for affective video content analysis , 2017, 2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD).

[7]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yuanliu Liu,et al.  Video-based emotion recognition using CNN-RNN and C3D hybrid networks , 2016, ICMI.

[9]  Thomas S. Huang,et al.  How deep neural networks can improve emotion recognition on video data , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[10]  Emmanuel Dellandréa,et al.  LIRIS-ACCEDE: A Video Database for Affective Content Analysis , 2015, IEEE Transactions on Affective Computing.

[11]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[14]  Touradj Ebrahimi,et al.  Multimedia content analysis for emotional characterization of music video clips , 2013, EURASIP J. Image Video Process..

[15]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Rongrong Ji,et al.  Video indexing and recommendation based on affective analysis of viewers , 2011, MM '11.

[17]  Joonwhoan Lee,et al.  Fuzzy Similarity-Based Emotional Classification of Color Images , 2011, IEEE Transactions on Multimedia.

[18]  Peter Y. K. Cheung,et al.  Affective Level Video Segmentation by Utilizing the Pleasure-Arousal-Dominance Information , 2008, IEEE Transactions on Multimedia.

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[20]  R. Thayer The biopsychology of mood and arousal , 1989 .

[21]  M. Akita,et al.  Objective evaluation of color design , 1987 .