Multi-modal learning for affective content analysis in movies

Affective content analysis is an important research topic in video content analysis, and has extensive applications in many fields. However, it is a challenging task to design a computational model for predicting emotions induced by videos, since the elicited emotions can be considered relatively subjective. Intuitively, several features of different modalities can depict the elicited emotions, but the correlation and influence of these features are still not well studied. To address this issue, we propose a multi-modal learning framework, which classifies affective contents in the valence-arousal space. In particular, we utilize the features extracted by the methods of motion keypoint trajectory and convolutional neural networks to depict the visual modality of elicited emotions, and extract a global audio feature by the openSMILE toolkit to describe the audio modality. Then, the linear support vector machine and support vector regression are employed to learn the affective models. By comparing these three features with five baseline features, we discover that the three features are significant for describing affective content. Experimental results also demonstrate that the three features complement each other. Moreover, the proposed framework obtains the state-of-the-art results on two challenging datasets of video affective content analysis.

[1]  Chih-Jen Lin,et al.  Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  The ICL-TUM-PASSAU Approach for the MediaEval 2015 "Affective Impact of Movies" Task , 2015, MediaEval.

[4]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[5]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[6]  Shamik Sural,et al.  Segmentation and histogram generation using the HSV color space for image retrieval , 2002, Proceedings. International Conference on Image Processing.

[7]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[8]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[9]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[10]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[11]  Kiyoharu Aizawa,et al.  Determination of emotional content of video clips by low-level audiovisual features , 2011, Multimedia Tools and Applications.

[12]  Kiyoharu Aizawa,et al.  Affective Audio-Visual Words and Latent Topic Driving Model for Realizing Movie Affective Scene Classification , 2010, IEEE Transactions on Multimedia.

[13]  Erik Cambria,et al.  Towards an intelligent framework for multimodal affective data analysis , 2015, Neural Networks.

[14]  Markus Schedl,et al.  RFA at MediaEval 2015 Affective Impact of Movies Task: A Multimodal Approach , 2015, MediaEval.

[15]  Changsheng Xu,et al.  Right buddy makes the difference: an early exploration of social relation analysis in multimedia applications , 2012, ACM Multimedia.

[16]  Chao Li,et al.  Error-correcting output codes for multi-label emotion classification , 2016, Multimedia Tools and Applications.

[17]  Emmanuel Dellandréa,et al.  The MediaEval 2015 Affective Impact of Movies Task , 2015, MediaEval.

[18]  Xi Wang,et al.  Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning , 2015, MediaEval.

[19]  Shiliang Zhang,et al.  Affective MTV analysis based on arousal and valence features , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[20]  Emmanuel Dellandréa,et al.  Affective Video Content Analysis: A Multidisciplinary Insight , 2018, IEEE Transactions on Affective Computing.

[21]  Mingxing Xu,et al.  THU-HCSI at MediaEval 2016: Emotional Impact of Movies Task , 2016, MediaEval.

[22]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[23]  Yan Liu,et al.  Mining Emotional Features of Movies , 2016, MediaEval.

[24]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[25]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[26]  Junqing Yu,et al.  Video Affective Content Representation and Recognition Using Video Affective Tree and Hidden Markov Models , 2007, ACII.

[27]  Riccardo Leonardi,et al.  Affective Recommendation of Movies Based on Selected Connotative Features , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[28]  Ehtesham Hassan,et al.  TCS-ILAB - MediaEval 2015: Affective Impact of Movies and Violent Scene Detection , 2015, MediaEval.

[29]  Changsheng Xu,et al.  User-Aware Image Tag Refinement via Ternary Semantic Analysis , 2012, IEEE Transactions on Multimedia.

[30]  Loong Fah Cheong,et al.  Affective understanding in film , 2006, IEEE Trans. Circuits Syst. Video Technol..

[31]  Uma Shanker Tiwary,et al.  Affect representation and recognition in 3D continuous valence–arousal–dominance space , 2016, Multimedia Tools and Applications.

[32]  Jun Yu,et al.  Click Prediction for Web Image Reranking Using Multimodal Sparse Coding , 2014, IEEE Transactions on Image Processing.

[33]  Leontios J. Hadjileontiadis,et al.  AUTH-SGP in MediaEval 2016 Emotional Impact of Movies Task , 2016, MediaEval.

[34]  Rainer Stiefelhagen,et al.  KIT at MediaEval 2015 - Evaluating Visual Cues for Affective Impact of Movies Task , 2015, MediaEval.

[35]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[36]  Alan Hanjalic,et al.  Affective video content representation and modeling , 2005, IEEE Transactions on Multimedia.

[37]  Zhuowen Tu,et al.  Robust Point Matching via Vector Field Consensus , 2014, IEEE Transactions on Image Processing.

[38]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[39]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[40]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[41]  Hanli Wang,et al.  Motion keypoint trajectory and covariance descriptor for human action recognition , 2018, The Visual Computer.

[42]  Jana Eggink,et al.  A Large Scale Experiment for Mood-Based Classification of TV Programmes , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[43]  Emmanuel Dellandréa,et al.  The MediaEval 2016 Emotional Impact of Movies Task , 2016, MediaEval.

[44]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Xiangjian He,et al.  A three-level framework for affective content analysis and its case studies , 2014, Multimedia Tools and Applications.

[46]  N. Ayache,et al.  Log‐Euclidean metrics for fast and simple calculus on diffusion tensors , 2006, Magnetic resonance in medicine.

[47]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[48]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[49]  Alberto Del Bimbo,et al.  Deep Sentiment Features of Context and Faces for Affective Video Analysis , 2017, ICMR.

[50]  Bowen Zhang,et al.  MIC-TJU in MediaEval 2015 Affective Impact of Movies Task , 2015, MediaEval.

[51]  Chih-Jen Lin,et al.  Dual coordinate descent methods for logistic regression and maximum entropy models , 2011, Machine Learning.

[52]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[53]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[54]  Fan Zhang,et al.  BUL in MediaEval 2016 Emotional Impact of Movies Task , 2016, MediaEval.

[55]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[56]  Qin Jin,et al.  RUC at MediaEval 2016 Emotional Impact of Movies Task: Fusion of Multimodal Features , 2016, MediaEval.

[57]  Fei Gao,et al.  Deep Multimodal Distance Metric Learning Using Click Constraints for Image Ranking , 2017, IEEE Transactions on Cybernetics.

[58]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[59]  Shiliang Zhang,et al.  Utilizing affective analysis for efficient movie browsing , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[60]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[61]  Vu Lam,et al.  NII-UIT at MediaEval 2015 Affective Impact of Movies Task , 2015, MediaEval.

[62]  Xiangyang Xue,et al.  Predicting Emotions in User-Generated Videos , 2014, AAAI.

[63]  Chia-Hua Ho,et al.  Large-scale linear support vector regression , 2012, J. Mach. Learn. Res..

[64]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[65]  Thierry Dutoit,et al.  UMons at MediaEval 2015 Affective Impact of Movies Task including Violent Scenes Detection , 2015, MediaEval.

[66]  Emmanuel Dellandréa,et al.  LIRIS-ACCEDE: A Video Database for Affective Content Analysis , 2015, IEEE Transactions on Affective Computing.

[67]  Qiang Ji,et al.  Video Affective Content Analysis: A Survey of State-of-the-Art Methods , 2015, IEEE Transactions on Affective Computing.

[68]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[69]  Jun Wu,et al.  Human Action Recognition With Trajectory Based Covariance Descriptor In Unconstrained Videos , 2015, ACM Multimedia.

[70]  Frank Hopfgartner,et al.  A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material , 2016, Multimedia Tools and Applications.