Zero-shot video retrieval using content and concepts

Recent research in video retrieval has been successful at finding videos when the query consists of tens or hundreds of sample relevant videos for training supervised models. Instead, we investigate unsupervised zero-shot retrieval where no training videos are provided: a query consists only of a text statement. For retrieval, we use text extracted from images in the videos, text recognized in the speech of its audio track, as well as automatically detected semantically meaningful visual video concepts identified with widely varying confidence in the videos. In this work we introduce a new method for automatically identifying relevant concepts given a text query using the Markov Random Field (MRF) retrieval framework. We use source expansion to build rich textual representations of semantic video concepts from large external sources such as the web. We find that concept-based retrieval significantly outperforms text based approaches in recall. Using an evaluation derived from the TRECVID MED'11 track, we present early results that an approach using multi-modal fusion can compensate for inadequacies in each modality, resulting in substantial effectiveness gains. With relevance feedback, our approach provides additional improvements of over 50%.

[1]  Rong Yan,et al.  Semantic concept-based query expansion and re-ranking for multimedia retrieval , 2007, ACM Multimedia.

[2]  Jianping Fan,et al.  ClassView: hierarchical video shot classification, indexing, and accessing , 2004, IEEE Transactions on Multimedia.

[3]  Afshin Dehghan,et al.  SRI-Sarnoff AURORA System at TRECVID 2013 Multimedia Event Detection and Recounting , 2013, TRECVID.

[4]  Jin Zhao,et al.  Video Retrieval Using High Level Features: Exploiting Query Matching and Confidence-Based Weighting , 2006, CIVR.

[5]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[6]  A. Waibel,et al.  A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[7]  John R. Smith,et al.  Multimedia semantic indexing using model vectors , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[8]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[9]  Li Li,et al.  A Survey on Visual Content-Based Video Indexing and Retrieval , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[10]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[11]  Teruko Mitamura,et al.  Multimodal knowledge-based analysis in multimedia event detection , 2012, ICMR '12.

[12]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[13]  Stephen E. Robertson,et al.  The TREC 2002 Filtering Track Report , 2002, TREC.

[14]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[15]  Paul Over,et al.  The TREC-2002 Video Track Report , 2002, TREC.

[16]  Dong Wang,et al.  Video search in concept subspace: a text-like paradigm , 2007, CIVR '07.

[17]  Marcel Worring,et al.  The Mediamill Semantic Video Search Engine , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.