论文信息 - Oxford TRECVid 2007 \u2013 Notebook paper

Oxford TRECVid 2007 \u2013 Notebook paper

The Oxford team participated in the high-level feature extraction and interactive search tasks. A vision only approach was used for both tasks, with no use of the text or audio information. For the high-level feature extraction task, we used two different approaches, both based on sparse visual features. One used a standard bag-of-words representation, while the other additionally used a lower-dimensional “topic”-based representation generated by Latent Dirichlet Allocation (LDA). For both methods, we trained χ-based SVM classifiers for all high-level features using publicly available annotations [3]. In addition, for certain features, we took a more targeted approach. Features based on human actions, such as “Walking/Running” and “People Marching”, were answered by using a robust pedestrian detector on every frame, coupled with an action classifier targeted to each feature to give highprecision results. For “Face” and “Person”, we used a realtime face detector and pedestrian detector, and for “Car” and “Truck”, we used a classifier which localized the vehicle in each image, trained on an external set of images of side and front views. We submitted 6 different runs. OXVGG_1(0.073 mAP) was our best run, which used a fusion of our LDA and bag-of-words results for most features, but favored our specific methods for features where these were available. OXVGG_2(0.062 mAP) and OXVGG_3(0.060 mAP) were variations on this first run, using different parameter settings. OXVGG_4(0.060 mAP) used LDA for all features and OXVGG_5(0.059 mAP) used bag-of-words for all features. OXVGG_6(0.066 mAP) was a variation of our first run. We came first in “Mountain” and were in the top five for “Studio”, “Car”, “Truck” and “Explosion/Fire”. Our main observation this year is that we can boost retrieval performance by using tailored approaches for specific concepts. For the interactive search task, we coupled the results generated during the high-level task with methods to facilitate efficient and productive interactive search. Our system allowed for several “expansion” methods based on different image representations. The main differences between this year’s system and last year’s was the availability of many more expansion methods and a “temporal zoom” facility which proved invaluable to answering the many action queries in this year’s task. We submitted just one run, I_C_2_VGG_I_1_1, which came second overall with an mAP of 0.328, and came first in 5 queries. 1 High-level Feature Extraction For the high-level feature task, we used two generic methods which were run for all topics and used more specialized methods for particular topics. These results were then fused to create the final submission. 1.1 Generic Approaches For the following approaches, we used a reduced subset of MPEG i-frames from each shot, found by clustering i-frames within a shot. Our approach here was to train an SVM for the concept in question, then score all frames in the test set using their distance from the discriminating hyper-plane. We then subsequently ranked the test shots by the maximum score over the reduced i-frames. We have developed two different methods for this task, each differing only in their representations. The first uses a standard bag-of-words representation and the second concatenates this bag-of-words representation with a topic-based LDA representation. 1.1.1 Bag of visual word representation The first method uses a bag of (visual) words [29] representation for the frames, where positional relationships between features are ignored. This representation has proved successful for classifying images according to whether they contain visual categories (such as cars, horses, etc) by training an SVM [10]. Here we use the kernel formulation proposed by [33]. Figure 1: An example of Hessian-Laplace regions used in the bag of words method. Left: original image; right: sparse detected regions overlaid as ellipses. Features and bag of words representation. We used Hessian Laplace(HL) [21] interest points coupled with a SIFT [20] descriptor. This combination of detection and description generates features which are approximately invariant to an affine transformation of the image, see figure 1. These features are computed for all reduced i-frames. The “visual vocabulary” is then constructed by running unsupervised K-means clustering over both the training and test data. The K-means cluster centres define the visual words. We used a vocabulary size of K = 10, 000 visual words. The SIFT features in each reduced i-frame are then assigned to the nearest cluster centre, to give the visual word representation, and the number of occurrences of each visual word is recorded in a histogram. This histogram of visual words is the bag of visual words model for that frame. Topic-based representation We use the Latent Dirichlet Allocation [5, 16] model to obtain a low dimensional representation of the bag-of-visual-words feature vectors. Similar low dimensional representations have been found useful in the context of unsupervised [26, 28] and supervised [6, 25] object and scene category recognition, and image retrieval [17, 27]. We pool together both TRECVid training and test data in the form of 10,000 dimensional bag-ofvisual words vectors and learn 20, 50, 100, 500 and 1,000 topic models. The models are fitted using the Gibbs sampler described in [16]. These representations are concatanated into a single feature vector, each one independantly normalized, such that the bag-of-words and the individual topic representations are each given equal weight. This approach was found to work best using a validation set taken from the training data. SVM classification. To predict whether a keyframe from the test set belongs to a concept, an SVM classifier is trained for each concept. Specifically, a kernel SVM with χ kernel K(p, q) = e−αχ 2(p,q)

[1] Andrew Zisserman,et al. Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[2] Stéphane Ayache,et al. Evaluation of active learning strategies for video indexing , 2007, Signal Process. Image Commun..

[3] Andrew Zisserman,et al. Scene Classification Via pLSA , 2006, ECCV.

[4] Luc Van Gool,et al. Modeling scenes with local descriptors and latent aspects , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[5] Andrew Blake,et al. A sparse probabilistic learning algorithm for real-time tracking , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[6] Michael Isard,et al. Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[7] Andrew Zisserman,et al. Oxford TRECVID 2006 - Notebook paper , 2006, TRECVID.

[8] Michael Isard,et al. General Theory , 1969 .

[9] Cordelia Schmid,et al. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[10] Andrew Zisserman,et al. An Exemplar Model for Learning Object Classes , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[11] Rainer Lienhart,et al. Image retrieval on large-scale image databases , 2007, CIVR '07.

[12] Gabriela Csurka,et al. Visual categorization with bags of keypoints , 2002, eccv 2004.

[13] Cordelia Schmid,et al. A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[14] Mark Steyvers,et al. Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[15] Daniel P. Huttenlocher,et al. Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[16] Andrew Zisserman,et al. Regression and classification approaches to eye localization in face images , 2006, 7th International Conference on Automatic Face and Gesture Recognition (FGR06).

[17] Luc Van Gool,et al. Real-time affine region tracking and coplanar grouping , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[18] Paul A. Viola,et al. Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[19] Cordelia Schmid,et al. An Affine Invariant Interest Point Detector , 2002, ECCV.

[20] Thorsten Joachims,et al. Making large scale SVM learning practical , 1998 .

[21] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[22] Michael Isard,et al. Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[23] Alexei A. Efros,et al. Using Multiple Segmentations to Discover Objects and their Extent in Image Collections , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[24] Andrew Zisserman,et al. Who Are You? - Real-time Person Identification , 2007, BMVC.

[25] S. Lazebnik,et al. Local Features and Kernels for Classification of Texture and Object Categories: An In-Depth Study , 2005 .

[26] Alexei A. Efros,et al. Discovering object categories in image collections , 2005 .

[27] Andrew Zisserman,et al. Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[28] Javed A. Aslam,et al. Models for metasearch , 2001, SIGIR '01.