MIC-TJU in MediaEval 2017 Emotional Impact of Movies Task

To predict the emotional impact and fear of movies, we propose a framework which employs four audio-visual features. In particular, we utilize the features extracted by the methods of motion keypoint trajectory and convolutional neural networks to depict the visual information, and extract a global and a local audio features to describe the audio cues. The early fusion strategy is employed to combine the vectors of these features. Then, the linear support vector regression and support vector machine are used to learn the affective models. The experimental results show that the combination of these features obtains promising performances.

[1]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[2]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Bowen Zhang,et al.  Learning correlations for human action recognition in videos , 2017, Multimedia Tools and Applications.

[4]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[5]  Emmanuel Dellandréa,et al.  The MediaEval 2016 Emotional Impact of Movies Task , 2016, MediaEval.

[6]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[7]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[9]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[10]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[11]  Zhuowen Tu,et al.  Robust Point Matching via Vector Field Consensus , 2014, IEEE Transactions on Image Processing.

[12]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[13]  Hanli Wang,et al.  Motion keypoint trajectory and covariance descriptor for human action recognition , 2018, The Visual Computer.

[14]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[15]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.