Impact of novel sources on content-based image and video retrieval

The problem of content-based image and video retrieval with textual queries is often posed as that of visual concept classification, where classifiers for a set of predetermined visual concepts are trained using a set of manually annotated images. Such a formulation implicitly assumes that the training data has similar distributional characteristics as that of the data which need to be indexed. In this paper we demonstrate empirically that even within the relatively narrow domain of news videos collected from a variety of news programs and broadcasters, the assumption of distributional similarity of visual features does not hold across programs from different broadcasters. This is manifested in considerable degradation of ranked retrieval performance on novel sources. We observe that concepts whose spatial locations remain relatively fixed between various sources are also more robust to source mismatches, and vice versa. We also show that a simple averaging of multiple visual detectors is more robust than any of the individual detectors. Furthermore, we show that for certain sources using only 20% of the available annotated data can bridge roughly 80% of the performance drop, while others can require larger amounts of annotated data.

[1]  Sanjeev Khudanpur,et al.  Source Adaptation for Improved Content-Based Video Retrieval , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[2]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[3]  Paul Over,et al.  TRECVID 2007--Overview , 2007, TRECVID.

[4]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[5]  Shih-Fu Chang,et al.  Cross-domain learning methods for high-level visual concept classification , 2008, 2008 15th IEEE International Conference on Image Processing.

[6]  Paul Over,et al.  TRECVID 2006 Overview , 2006, TRECVID.

[7]  Dietrich Klakow Using Regional Information in Language Model Based Automatic Concept Annotation and Retrieval Of Video , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[8]  Rong Yan,et al.  Cross-domain video concept detection using adaptive svms , 2007, ACM Multimedia.

[9]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[10]  Paul Over,et al.  TRECVID 2005 - An Overview , 2005, TRECVID.

[11]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[12]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[13]  Sanjeev Khudanpur,et al.  Hidden Markov models for automatic annotation and content-based retrieval of images and video , 2005, SIGIR '05.