Multimodality and Deep Learning when Predicting Media Interestingness

This paper summarizes the computational models that Technicolor proposes to predict interestingness of images and videos within the MediaEval 2017 PredictingMedia Interestingness Task. Our systems are based on deep learning architectures and exploit the use of both semantic and multimodal features. Based on the obtained results, we discuss our findings and obtain some scientific perspectives for the task.

[1]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[2]  Claire-Hélène Demarty,et al.  Technicolor@MediaEval 2016 Predicting Media Interestingness Task , 2016, MediaEval.

[3]  Nick Cramer,et al.  Automatic Keyword Extraction from Individual Documents , 2010 .

[4]  Svetlana Lazebnik,et al.  Enhancing Video Summarization via Vision-Language Embedding , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Francis K. H. Quek,et al.  The effect of familiarity on perceived interestingness of images , 2013, Electronic Imaging.

[7]  Mats Sjöberg,et al.  Predicting Interestingness of Visual Content , 2017, Visual Content Indexing and Retrieval with Psycho-Visual Models.

[8]  Xin Rong,et al.  word2vec Parameter Learning Explained , 2014, ArXiv.

[9]  Mats Sjöberg,et al.  MediaEval 2017 Predicting Media Interestingness Task , 2016, MediaEval.

[10]  Luc Van Gool,et al.  Visual interestingness in image sequences , 2013, MM '13.