Filling the Semantic Gap in Video Retrieval: An Exploration

Digital images and motion video have proliferated in the past few years, ranging from ever-growing personal photo and video collections to professional news and documentary archives. In searching through these archives, digital imagery indexing based on low-level image features like colour and texture, or manually entered text annotations, often fails to meet the user’s information need, i.e. there is often a semantic gap produced by “the lack of coincidence between the information that one can extract from the visual data and the interpretation that the same data have for a user in a given situation” (Smeulders, Worring, Santini, Gupta and Jain 2000). The image/video analysis community has long struggled to bridge this semantic gap between low-level feature analysis (colour histograms, texture, shape) and semantic content description of video. Early video retrieval systems (Lew 2002; Smith, Lin, Naphade, Natsev and Tseng 2002) usually modelled video clips with a set of (low-level) detectable features generated from different modalities. It is possible to accurately and automatically extract low-level video features, such as histograms in the HSV, RGB, and YUV colour space, Gabor texture or wavelets, and structure through edge direction histograms and edge maps. However, because the semantic meaning of the video content cannot be expressed this way, these systems had a very restricted success with this approach to video retrieval for semantic queries. Several studies have confirmed the difficulty of addressing information needs with such low-level features (Markkula and Sormunen 2000; Rodden, Basalaj, Sinclair and Wood 2001). To overcome this “semantic gap”, one approach is to utilise a set of intermediate textual descriptors that can be reliably applied to visual content concepts (e.g. outdoors, faces, animals). Many researchers have been developing automatic semantic concept classifiers such as those related to people (face, anchor, etc.), acoustic (speech, music, significant pause), objects (image blobs, buildings, graphics),

[1]  Rong Yan,et al.  Probabilistic models for combining diverse knowledge sources in multimedia retrieval , 2006 .

[2]  Paul Over,et al.  TRECVID: Benchmarking the Effectivenss of Information Retrieval Tasks on Digital Video , 2003, CIVR.

[3]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[4]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[5]  Alexander G. Hauptmann,et al.  Towards a Large Scale Concept Ontology for Broadcast Video , 2004, CIVR.

[6]  Alexander G. Hauptmann,et al.  Successful approaches in the TREC video retrieval evaluations , 2004, MULTIMEDIA '04.

[7]  Gang Wang,et al.  TRECVID 2004 Search and Feature Extraction Task by NUS PRIS , 2004, TRECVID.

[8]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[9]  Rong Yan,et al.  The combination limit in multimedia retrieval , 2003, MULTIMEDIA '03.

[10]  Brendan J. Frey,et al.  Probabilistic multimedia objects (multijects): a novel approach to video indexing and retrieval in multimedia systems , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[11]  Sara Shatford,et al.  Analyzing the Subject of a Picture: A Theoretical Approach , 1986 .

[12]  Shih-Fu Chang,et al.  Combining text and audio-visual features in video indexing , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[13]  Hans-Peter Frei,et al.  Concept based query expansion , 1993, SIGIR.

[14]  Wei-Hao Lin,et al.  News video classification using SVM-based multimodal classifiers and combination strategies , 2002, MULTIMEDIA '02.

[15]  Marcel Worring,et al.  The Semantic Pathfinder: Using an Authoring Metaphor for Generic Multimedia Indexing , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Rong Yan,et al.  Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News , 2007, IEEE Transactions on Multimedia.

[17]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[18]  Paul Over,et al.  TRECVID 2005 - An Overview , 2005, TRECVID.

[19]  Arden Alexander,et al.  The Thesaurus for Graphic Materials: Its History, Use, and Future , 2001 .

[20]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[21]  Dong Xu,et al.  Columbia University TRECVID-2006 Video Search and High-Level Feature Extraction , 2006, TRECVID.

[22]  Rong Yan,et al.  Learning query-class dependent weights in automatic video retrieval , 2004, MULTIMEDIA '04.

[23]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[24]  Douglas B. Lenat,et al.  Mapping Ontologies into Cyc , 2002 .

[25]  Ophir Frieder,et al.  Surrogate scoring for improved metasearch precision , 2005, SIGIR '05.

[26]  Edward Y. Chang,et al.  Optimal multimodal fusion for multimedia data analysis , 2004, MULTIMEDIA '04.

[27]  John R. Smith,et al.  On the detection of semantic concepts at TRECVID , 2004, MULTIMEDIA '04.

[28]  Dan I. Moldovan,et al.  Exploiting ontologies for automatic image annotation , 2005, SIGIR '05.

[29]  Jun Yang,et al.  Finding Person X: Correlating Names with Visual Appearances , 2004, CIVR.

[30]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[32]  John R. Kender,et al.  Visual concepts for news story tracking: analyzing and exploiting the NIST TRESVID video annotation experiment , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[33]  Milind R. Naphade,et al.  Learning the semantics of multimedia queries and concepts from a small number of examples , 2005, MULTIMEDIA '05.

[34]  Alexander G. Hauptmann,et al.  The Use and Utility of High-Level Semantic Features in Video Retrieval , 2005, CIVR.

[35]  Jin Zhao,et al.  Video Retrieval Using High Level Features: Exploiting Query Matching and Confidence-Based Weighting , 2006, CIVR.

[36]  Tobun Dorbin Ng,et al.  Informedia at TRECVID 2003 : Analyzing and Searching Broadcast News Video , 2003, TRECVID.

[37]  George Kingsley Zipf,et al.  Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology , 2012 .

[38]  Eero Sormunen,et al.  End-User Searching Challenges Indexing Practices in the Digital Newspaper Photo Archive , 2004, Information Retrieval.

[39]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[40]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.

[41]  Liang-Tien Chia,et al.  Does ontology help in image retrieval?: a comparison between keyword, text ontology and multi-modality ontology approaches , 2006, MM '06.

[42]  Kerry Rodden,et al.  Does organisation by similarity assist image browsing? , 2001, CHI.

[43]  Apostol Natsev,et al.  Exploring Automatic Query Refinement for Text-Based Video Retrieval , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[44]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.