Query and Keyframe Representations for Ad-hoc Video Search

This paper presents a fully-automatic method that combines video concept detection and textual query analysis in order to solve the problem of ad-hoc video search. We present a set of NLP steps that cleverly analyse different parts of the query in order to convert it to related semantic concepts, we propose a new method for transforming concept-based keyframe and query representations into a common semantic embedding space, and we show that our proposed combination of concept-based representations with their corresponding semantic embeddings results to improved video search accuracy. Our experiments in the TRECVID AVS 2016 and the Video Search 2008 datasets show the effectiveness of the proposed method compared to other similar approaches.

[1]  Georges Quénot,et al.  TRECVid Semantic Indexing of Video: A 6-year Retrospective , 2016 .

[2]  Ioannis Patras,et al.  Comparison of Fine-Tuning and Extension Strategies for Deep Convolutional Neural Networks , 2017, MMM.

[3]  Shaogang Gong,et al.  Zero-shot object recognition by semantic manifold distance , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  M. de Rijke,et al.  Siamese CBOW: Optimizing Word Embeddings for Sentence Representations , 2016, ACL.

[5]  Philip H. S. Torr,et al.  An embarrassingly simple approach to zero-shot learning , 2015, ICML.

[6]  Tetsunori Kobayashi,et al.  Waseda at TRECVID 2016: Ad-hoc Video Search , 2016, TRECVID.

[7]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[8]  Jonathan G. Fiscus,et al.  TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking , 2016, TRECVID.

[9]  Jiande Sun,et al.  Informedia @ TRECVID 2016 , 2016, TRECVID.

[10]  Babak Saleh,et al.  Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[11]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[12]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[13]  Chong-Wah Ngo,et al.  VIREO @ TRECVID 2016: Multimedia Event Detection, Ad-hoc Video Search, Video to Text Description , 2016, TRECVID.

[14]  Chong-Wah Ngo,et al.  Event Detection with Zero Example: Select the Right and Suppress the Wrong Concepts , 2016, ICMR.

[15]  Duy-Dinh Le,et al.  NII-HITACHI-UIT at TRECVID 2017 , 2016, TRECVID.

[16]  Ioannis Patras,et al.  Learning to detect video events from zero or very few video examples , 2015, Image Vis. Comput..

[17]  Paul Over,et al.  TRECVID 2008 - Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2010, TRECVID.

[18]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[19]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[20]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[21]  Yiannis Kompatsiaris,et al.  ITI-CERTH participation to TRECVID 2015 , 2015, TRECVID.

[22]  Cees Snoek,et al.  VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events , 2014, ACM Multimedia.

[23]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.