Visual and Audio Aware Bi-Modal Video Emotion Recognition

With rapid increase in the size of videos online, analysis and prediction of affective impact that video content will have on viewers has attracted much attention in the community. To solve this challenge several different kinds of information about video clips are exploited. Traditional methods normally focused on single modality, either audio or visual. Later on some researchers tried to establish multi-modal schemes and spend a lot of time choosing and extracting features by different fusion strategy. In this research, we proposed an end-toend model which can automatically extract features and target an emotional classification task by integrating audio and visual features together and also adding the temporal characteristics of the video. The experimental study on commonly used MediaEval 2015 Affective Impact of Movies has shown this method’s potential and it is expected that this work could provide some insight for future video emotion recognition from feature fusion perspective.

[1]  The ICL-TUM-PASSAU Approach for the MediaEval 2015 "Affective Impact of Movies" Task , 2015, MediaEval.

[2]  Mohammad Soleymani,et al.  A Bayesian framework for video affective representation , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[3]  Peter Y. K. Cheung,et al.  Affective Level Video Segmentation by Utilizing the Pleasure-Arousal-Dominance Information , 2008, IEEE Transactions on Multimedia.

[4]  Hang-Bong Kang,et al.  Affective content detection using HMMs , 2003, ACM Multimedia.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Frank Hopfgartner,et al.  Understanding Affective Content of Music Videos through Learned Representations , 2014, MMM.

[7]  Emmanuel Dellandréa,et al.  LIRIS-ACCEDE: A Video Database for Affective Content Analysis , 2015, IEEE Transactions on Affective Computing.

[8]  A. Hanjalic,et al.  Extracting moods from pictures and sounds: towards truly personalized TV , 2006, IEEE Signal Processing Magazine.

[9]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[10]  Christopher Joseph Pal,et al.  EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.

[11]  Ehtesham Hassan,et al.  TCS-ILAB - MediaEval 2015: Affective Impact of Movies and Violent Scene Detection , 2015, MediaEval.

[12]  Tom Barker,et al.  Blind Separation of Audio Mixtures Through Nonnegative Tensor Factorization of Modulation Spectrograms , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[14]  Min Xu,et al.  Affective content analysis in comedy and horror videos by audio emotional event detection , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[15]  Bowen Zhang,et al.  MIC-TJU in MediaEval 2015 Affective Impact of Movies Task , 2015, MediaEval.

[16]  Geoffrey E. Hinton,et al.  On rectified linear units for speech processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Emmanuel Dellandréa,et al.  The MediaEval 2015 Affective Impact of Movies Task , 2015, MediaEval.

[18]  Xi Wang,et al.  Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning , 2015, MediaEval.

[19]  Shiliang Zhang,et al.  Music video affective understanding using feature importance analysis , 2010, CIVR '10.

[20]  Lie Lu,et al.  Automatic mood detection and tracking of music audio signals , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Penny M. Pexman,et al.  Effects of Emotional Experience for Abstract Words in the Stroop Task , 2014, Cogn. Sci..

[22]  Byoung-Tak Zhang,et al.  Estimating Multiple Evoked Emotions from Videos , 2013, CogSci.

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  Rainer Stiefelhagen,et al.  KIT at MediaEval 2015 - Evaluating Visual Cues for Affective Impact of Movies Task , 2015, MediaEval.

[25]  Tanaya Guha,et al.  A multimodal mixture-of-experts model for dynamic emotion prediction in movies , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Tal Hassner,et al.  Emotion Recognition in the Wild via Convolutional Neural Networks and Mapped Binary Patterns , 2015, ICMI.

[27]  Loong Fah Cheong,et al.  Affective understanding in film , 2006, IEEE Trans. Circuits Syst. Video Technol..

[28]  Jont B. Allen,et al.  Short term spectral analysis, synthesis, and modification by discrete Fourier transform , 1977 .

[29]  S. Tiwari Deep features for multimodal emotion classification , 2016 .

[30]  Thierry Dutoit,et al.  UMons at MediaEval 2015 Affective Impact of Movies Task including Violent Scenes Detection , 2015, MediaEval.

[31]  Zhengbo Jiang,et al.  Video Affective Content Analysis Based on Protagonist via Convolutional Neural Network , 2016, PCM.