VIREO/DVMM at TRECVID 2009: High-Level Feature Extraction, Automatic Video Search, and Content-Based Copy Detection

This paper presents overview and comparative analysis of our systems designed for 3 TRECVID 2009 tasks: high-level feature extraction, automatic search, and content-based copy detection. High-Level Feature Extraction (HLFE): Our main focus for the HLFE task is on the study of a new method named domain adaptive semantic diffusion (DASD) [1], which exploits semantic context (concept relationship) while also considers the domain-shift-of-context to improve concept detection accuracy. We apply our TRECVID 2008 HLFE system [2] to construct baseline detectors for the 20 evaluated concepts, where both local and global features are explored. Evaluation results show that our 2008 system is still able to produce strong performance (Run 5: MAP=0.156). Over the 20 strong baseline detectors, DASD consistently improves 17 concepts using a set of 300+ relatively much weaker detectors (from VIREO-374 [3]) as contexts (Run 1–4). Our 6 submitted runs are summarized below: A vireo.dasd20scorelinear 1: DASD over a baseline using linear weighted fusion of local and global features. Concept affinity estimation method is the same to Run 3. A vireo.dasd20fcs 2: DASD over Run 5; using ground-truth annotations and Flickr context to estimate concept affinity. A vireo.dasd20score 3: DASD over Run 5; using ground-truth annotations and detection score to estimate concept affinity. A vireo.dasd10 4: DASD over Run 5; using ground-truth annotations to estimate concept affinity (only applied for 10 concepts). A vireo.localglobal 5: average fusion of local and global features. A vireo.localalone 6: local feature alone multiple detectors and spatial partitions. Automatic Video Search: For this task, in the past we have been focusing on concept-based video search [4, 5]. Given a textual query, various factors including semantic relatedness, co-occurrence, diversity, and detector robustness were jointly considered for better selection of the concept detectors. This year, in addition to textual queries, the visual query examples are also taken into account, and our main focus is on the combination of multiple search modalities. To this end we apply a concept-driven fusion scheme, which is able to dynamically discover the (near-)optimal modality weights for each query. Evaluation results confirm the effectiveness of our fusion approach, offering at least 10% improvement compared to the best uni-modality performance. F A N CityUHK1: multi-modality fusion of concept-based search (a slight different setup based on Run 5), query-by-example (Run 9), and text baseline (Run 10). F A N CityUHK2: multi-modality fusion of concept-based search (Run 5), query-byexample (Run 9), and text baseline (Run 10). F A N CityUHK3: multi-modality fusion of concept-based search (Run 6), query-byexample (Run 9), and text baseline (Run 10). F A N CityUHK4: multi-modality fusion of concept-based search (Run 7), query-byexample (Run 9), and text baseline (Run 10). F A N CityUHK5: concept-based search; using both textual and visual example queries for concept selection. F A N CityUHK6: concept-based search; using textual queries for concept selection based on semantic and context spaces [5]. F A N CityUHK7: concept-based search; using textual queries for domain adaptive concept selection based on Flickr context similarity. F A N CityUHK8: concept-based search; using textual queries for concept selection based on Flickr context similarity. F A N CityUHK9: query-by-visual-example. F A N CityUHK10: text-based search. Content-Based Video Copy Detection: Our approach for copy detection is mainly based on our recent work on near-duplicate keyframe detection [6]. We consider only two features: bag-of-visual-words (BoW) based on SIFT and bag-of-audio-words (BoA) based on MFCC. To achieve fast and accurate BoW-based detection, indexing and various geometric verification techniques are employed. We submitted 2 video-only runs and 3 audio-video runs (see descriptions in Section 3). 1 High-Level Feature Extraction In TRECVID 2009, we experiment our recently proposed algorithm, named domain adaptive semantic diffusion (DASD) [1], for context-based concept fusion. Starting from hundreds of ......... ......... ... ... Vocabulary Construction BoW Representation Figure 1: Our TRECVID-2009 local feature-based keyframe representation framework. individually developed concept detectors, DASD exploits semantic context (concept relationship) to refine concept detection scores using graph diffusion technique. Particularly, it involves a semantic context adaptation process to cope with domain change between training and test data. We adopt our 2008 HLFE system as a baseline. In the end, we find that the well designed 2008 system which utilizes both local and global features still produces excellent performance with MAP=0.156, and the DASD algorithm is capable of consistently improving such a strong baseline for most of the evaluated concepts. 1.1 Baseline Detectors Using Local and Global Features Bag-of-visual-words (BoW) representation derived from local keypoint features has been playing a very important role in a successful concept detection system. For this, we slightly update our 2008 BoW representation framework (see Figure 2 in [2]), by removing a keypoint detector and adding in one more spatial partition. The new framework is show in Figure 1. Detector MSER is dropped since it did not help much in TRECVID 2008. As using multiple spatial resolutions tends to be helpful, we add in a 3 × 1 partition. At the end for each concept, there are four SVMs to be trained using BoW histograms. For more details about this BoW representation, please refer to [2, 7]. We extract two kinds of global features: grid-based color moments (CM) and grid-based wavelet texture (WT). For CM, we calculate the first 3 moments of 3 channels in Lab color space over 5× 5 grids, and aggregate the features into a 225-d feature vector. For WT, we use 3 × 3 grids and each grid is represented by the variances in 9 Haar wavelet sub-bands to form a 81-d feature vector. Two SVMs are trained for each concept using the two global features respectively. Given a test keyframe1, the SVM classifiers are applied on the same set of features for For both HLFE and automatic search tasks, we extract 3 keyframes from each test shot. v e h i c l e r o a d W a t e r s k y 0 . 1 0 . 2 0 . 8 0 . 5 0 . 1 ...0 . 4 0 . 1 0 . 6 0 . 1 0 . 1 0 . 0 ...0 . 8 0 . 0 0 . 4 0 . 5 0 . 2 0 . 2 ...0 . 7 0 . 0 0 . 1 0 . 9 0 . 2 0 . 1 ...0 . 3 r o a d v e h i c l e 0 . 0 5 0 . 1 9 0 . 8 0 0 . 4 6 0 . 1 3 0 . 0 1 0 . 1 2 0 . 9 1 0 . 1 8 0 . 0 5 W a t e r 0 . 1 1 0 . 5 8 0 . 1 0 0 . 1 3 0 . 0 2 S k y 0 . 0 1 0 . 3 6 0 . 5 3 0 . 1 7 0 . 2 3 Figure 2: Illustration of DASD using four example concepts. Over a set of testing keyframes, detectors of frequently concurrent concepts tend to produce highly correlated prediction scores (left; road and vehicle). Therefore we model concept relationship in a graph structure where each node is a concept, and the edge weight (line width) indicates concept affinity (right). Prediction scores of the individual concept detectors are then refined w.r.t. the concept affinities using graph diffusion technique. prediction. The raw outputs of the SVMs are converted into posterior probabilities (concept detection score). We then combine detection scores from the six SVMs in the “late fusion” manner, i.e. the final decision is made by fusing of the outputs multiple separate classifiers. In most of our experiments, “average fusion” is adopted to combine different classifiers. 1.2 Domain Adaptive Semantic Diffusion (DASD) Most video concept detection systems assign single or multiple concept labels to a test sample (keyframe), where the assignment is often done independently without considering the interconcept relationship. Due to the fact that concepts do not occur in isolation (e.g., smoke and explosion), more research attentions have been paid recently for improving detection accuracy by learning from semantic context (inter-concept relationship). The learning of contextual knowledge, however, is often conducted in an offline manner based on training data, resulting in the classical problem of over-fitting. For large scale semantic concept detection which could involve simultaneous labeling of hundreds of concepts, the problem becomes worse when the unlabeled videos are from a domain different from that of the training data. For example, concept weapon always co-occurs with desert in news videos due to plenty of events about Iraq war. When such context relationship is captured by using news videos as training data, misleading detection results will be generated if it is applied to documentary videos where such relationship is seldom observed. This brings two challenges related to scalability for context-based learning: the need for adaptive learning and the demand for efficient detection. DASD is designed to tackle these two challenges in a uniform fashion. As illustrated in Figure 2, one underlying assumption of DASD is that detectors of frequently concurrent concepts should produce highly correlated scores. We therefore construct an undirected and weighted graph, namely semantic graph, to model the concept affinities. The graph is then applied to refine concept detection scores using a function level diffusion process. The aim is to recover the consistency of the detection scores w.r.t. the concept relationship. To handle the domain change problem, DASD further allows to simultaneously optimize the detection results and adapt the geometry of the semantic graph (concept affinity) according to the test data distribution. More formally, the cost function of DASD is defined as:

[1]  Stephen E. Robertson,et al.  Okapi/Keenbow at TREC-8 , 1999, TREC.

[2]  Milind R. Naphade,et al.  Learning the semantics of multimedia queries and concepts from a small number of examples , 2005, MULTIMEDIA '05.

[3]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[4]  Chong-Wah Ngo,et al.  Semantic context transfer across heterogeneous sources for domain adaptive video search , 2009, ACM Multimedia.

[5]  Chong-Wah Ngo,et al.  Scale-Rotation Invariant Pattern Entropy for Keypoint-Based Near-Duplicate Detection , 2009, IEEE Transactions on Image Processing.

[6]  Chong-Wah Ngo,et al.  Columbia University/VIREO-CityU/IRIT TRECVID2008 High-Level Feature Extraction and Interactive Video Search , 2008, TRECVID.

[7]  Hung-Khoon Tan,et al.  Beyond Semantic Search: What You Observe May Not Be What You Think , 2008, TRECVID.

[8]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[9]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[10]  C. V. Ramamoorthy,et al.  Knowledge and Data Engineering , 1989, IEEE Trans. Knowl. Data Eng..

[11]  Chong-Wah Ngo,et al.  Domain adaptive semantic diffusion for large scale context-based video annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[12]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[13]  Chong-Wah Ngo,et al.  Representations of Keypoint-Based Semantic Concept Detection: A Comprehensive Study , 2010, IEEE Transactions on Multimedia.

[14]  Franciska de Jong,et al.  Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition , 2007, SAMT.

[15]  Chong-Wah Ngo,et al.  Fusing semantics, observability, reliability and diversity of concept detectors for video search , 2008, ACM Multimedia.