Affective Video Content Analyses by Using Cross-Modal Embedding Learning Features

Most existing methods on affective video content analyses are dedicated to single media, either visual content or audio content and few attempts for combined analysis of the two media signals are made. In this paper, we employ a cross-modal embedding learning approach to learn the compact feature representations of different modalities that are discriminative for analyzing the emotion attributes of the video. Specifically, we introduce inter-modal similarity constraints and intra-modal similarity constraints to promote the joint embedding learning procedure for obtaining the robust features. In order to capture cues in different grains, global and local features are extracted from both visual and audio signals, thereafter a unified framework consisting with global and local features embedding networks is built for affective video content analyses. Experiments show that our proposed approach significantly outperforms the state-of-the-art methods and demonstrate the effectiveness of our approach.

[1]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Emmanuel Dellandréa,et al.  The MediaEval 2016 Emotional Impact of Movies Task , 2016, MediaEval.

[4]  Qiang Ji,et al.  A Multimodal Deep Regression Bayesian Network for Affective Video Content Analyses , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Ping Hu,et al.  Learning supervised scoring ensemble for emotion recognition in the wild , 2017, ICMI.

[8]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[9]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[10]  Shiliang Zhang,et al.  Affective Visualization and Retrieval for Music Video , 2010, IEEE Transactions on Multimedia.

[11]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[12]  Amaia Salvador,et al.  Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Anil A. Bharath,et al.  Adversarial Training for Sketch Retrieval , 2016, ECCV Workshops.

[14]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[15]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Antonio Torralba,et al.  See, Hear, and Read: Deep Aligned Representations , 2017, ArXiv.

[17]  Yaxin Wang,et al.  Exploring Domain Knowledge for Affective Video Content Analyses , 2017, ACM Multimedia.

[18]  Frank Hopfgartner,et al.  Understanding Affective Content of Music Videos through Learned Representations , 2014, MMM.

[19]  Emmanuel Dellandréa,et al.  LIRIS-ACCEDE: A Video Database for Affective Content Analysis , 2015, IEEE Transactions on Affective Computing.

[20]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[21]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22]  Emmanuel Dellandréa,et al.  The MediaEval 2015 Affective Impact of Movies Task , 2015, MediaEval.