ETH-CVL @ MediaEval 2016: Textual-Visual Embeddings and Video2GIF for Video Interestingness

This paper presents the methods that underly our submission to the Predicting Media Interestingness Task at MediaEval 2016. Our contribution relies on two main approaches: (i) A similarity metric between image and text and (ii) a generic video highlight detector. In particular, we develop a method for learning the similarity of text and images, by projecting them into the same embedding space. This embedding allows to find video frames that are both, canonical and relevant w.r.t the title of the video. We present the result of different configurations and give insights into when our best performing method works well and where it has difficulties.

[1]  Alberto Del Bimbo,et al.  A data-driven approach for tag refinement and localization in web videos , 2015, Comput. Vis. Image Underst..

[2]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Radomír Mech,et al.  Event-Specific Image Importance , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[5]  Jing Wang,et al.  Clickage: towards bridging semantic and intent gaps via mining click logs of search engines , 2013, ACM Multimedia.

[6]  Yale Song,et al.  Fast, Cheap, and Good: Why Animated GIFs Engage Us , 2016, CHI.

[7]  Yongdong Zhang,et al.  Multi-task deep visual-semantic embedding for video thumbnail selection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Luc Van Gool,et al.  The Interestingness of Images , 2013, 2013 IEEE International Conference on Computer Vision.

[9]  Yuzhen Niu,et al.  Using Web Photos for Measuring Video Frame Interestingness , 2009, IJCAI.

[10]  Mats Sjöberg,et al.  MediaEval 2017 Predicting Media Interestingness Task , 2016, MediaEval.

[11]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Yale Song,et al.  Video2GIF: Automatic Generation of Animated GIFs from Video , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[14]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[15]  Mohammad Soleymani The Quest for Visual Interest , 2015, ACM Multimedia.

[16]  Xiangyang Xue,et al.  Understanding and Predicting Interestingness of Videos , 2013, AAAI.