This paper presents our approaches and results of the four TRECVID 2008 tasks we participated in: high-level feature extraction, automatic video search, video copy detection, and rushes summarization. In high-level feature extraction, we jointly submitted our results with Columbia University. The four runs submitted through CityU aim to explore context-based concept fusion by modeling inter-concept relationship. The relationship is modeled not based on semantic reasoning, but by observing how concepts correlate to each other, either directly or indirectly, in LSCOM common annotation [1]. An observability space (OS) [2] is thus built on top of LSCOM [1] and VIREO374 [3] for performing concept fusion. Since 19 of the 20 concepts evaluated this year appeared in VIREO-374, we apply OS to re-rank the results of both old models from VIREO-374 and new models from a joint baseline submission with Columbia. A CityU-HK1: re-rank A CU-run5 using OS – both positive and negative correlated concepts are used. A CityU-HK2: re-rank A CU-run5 using OS – only positive correlated concepts are used. A CityU-HK3: re-rank old models from VIREO-374 using OS – both positive and negative correlated concepts are used. A CityU-HK4: re-rank old models from VIREO-374 using OS – only positive correlated concepts are used. In automatic search, we focus on concept-based video search. The search is beyond semantic reasoning, where we consider the fusion of detectors using concept semantics, co-occurrence, diversity, and detector robustness. Two runs are submitted based on the works in [2] and [4] respectively. F A 2 CityUHK1 1: multi-modality fusion of concept-based search (Run-2), query example based search (Run-4 and Run-5), and text baseline (Run-6). F A 2 CityUHK2 2: concept-based search by fusing semantics, observability, reliability and diversity of concept detectors [2]. F A 2 CityUHK3 3: concept-based search using semantics reasoning [4, 5]. F A 2 CityUHK4 4: query-by-example – using VIREO-374 detection scores as features. F A 2 CityUHK5 5: query-by-example – using motion histograms as features. F A 1 CityUHK6 6: text baseline. In content-based video copy detection, we adopt a recently proposed near-duplicate video detection method [6, 7] based on the matching of local keypoint features. We submitted three runs: CityUHK loose: we use cosine similarity of visual word histograms to generate candidate near-duplicate keyframe set. The set is further filtered by a recently proposed method called SR-PE [6]. CityUHK vkisect: same with CityUHK loose except that we use histogram intersection instead of cosine similarity for candidate keyframe set generation. CityUHK tight: similar to CityUHK loose, but we add in few more heuristical constraints. In BBC rushes summarization, we submitted one run using the same method with our last year’s submission [8]. 1 High-Level Feature Extraction (HLFE) This year, we jointly submitted our HLFE results with Columbia University. Detailed descriptions of the joint submissions can be found in the notebook paper of Columbia [9]. For the four runs submitted by CityU, we aim to test context-based concept fusion based on a linear space (observability space) built from the observation derived from manual concept annotation. 1.1 Concept Fusion with Observability Space The observability space (OS) is proposed to effectively model the co-occurrence relationship among concepts [2]. We refine the individual concept detectors by using simple and efficient linear weighted fusion of the target concepts with several peripherally related concepts, where both concept selection and fusion weights are determined by the OS. Given a concept set V of n concepts, we first construct a n×n concept observability matrix R where each entry rij represents the co-occurrence relationship of a concept pair (Ci, Cj), measured by Pearson product-moment (PM) correlation: rij = PM(Ci, Cj) = ∑|T | k=1(Oik − μi)(Ojk − μj) (|T | − 1)σiσj (1) where Oik is the observability of concept Ci in shot k, and μi and σi are the sample mean and standard deviation, respectively, of observing Ci in a training set T . We set Oik to 1 if Ci presents in shot k, and 0 otherwise. With R, basis vectors C of OS can be estimated by solving following equation
[1]
Stephen E. Robertson,et al.
Okapi/Keenbow at TREC-8
,
1999,
TREC.
[2]
Hung-Khoon Tan,et al.
Experimenting VIREO-374: Bag-of-Visual-Words and Visual-Based Ontology for Semantic Video Indexing and search
,
2007,
TRECVID.
[3]
Chong-Wah Ngo,et al.
Fusing semantics, observability, reliability and diversity of concept detectors for video search
,
2008,
ACM Multimedia.
[4]
Yu-Gang Jiang,et al.
VIREO-374 : LSCOM Semantic Concept Detectors Using Local Keypoint Features
,
2007
.
[5]
G LoweDavid,et al.
Distinctive Image Features from Scale-Invariant Keypoints
,
2004
.
[6]
Sheng Tang,et al.
TRECVID 2007 High-Level Feature Extraction By MCG-ICT-CAS
,
2007,
TRECVID.
[7]
Chong-Wah Ngo,et al.
Scale-Rotation Invariant Pattern Entropy for Keypoint-Based Near-Duplicate Detection
,
2009,
IEEE Transactions on Image Processing.
[8]
Chong-Wah Ngo,et al.
Towards optimal bag-of-features for object categorization and semantic video retrieval
,
2007,
CIVR '07.
[9]
Chong-Wah Ngo,et al.
Ontology-enriched semantic space for video search
,
2007,
ACM Multimedia.
[10]
Chong-Wah Ngo,et al.
Selection of Concept Detectors for Video Search by Ontology-Enriched Semantic Spaces
,
2008,
IEEE Transactions on Multimedia.
[11]
Chong-Wah Ngo,et al.
Columbia University/VIREO-CityU/IRIT TRECVID2008 High-Level Feature Extraction and Interactive Video Search
,
2008,
TRECVID.
[12]
John R. Smith,et al.
Large-scale concept ontology for multimedia
,
2006,
IEEE MultiMedia.
[13]
Franciska de Jong,et al.
Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition
,
2007,
SAMT.
[14]
Chong-Wah Ngo,et al.
Rushes video summarization by object and event understanding
,
2007,
TVS '07.
[15]
Hung-Khoon Tan,et al.
Near-Duplicate Keyframe Identification With Interest Point Matching and Pattern Learning
,
2007,
IEEE Transactions on Multimedia.