论文信息 - TRECVID 2007 High-Level Feature Extraction By MCG-ICT-CAS

Oxford TRECVID 2006 - Notebook paper

The Oxford team participated in the high-level feature extraction and interactive search tasks. A vision only approach was used for both tasks, with no use of the text or audio information. For the high-level feature extraction task, we used two different approaches, one using sparse and one using dense visual features to learn classifiers for all 39 required concepts, using the training data supplied by MediaMill [29] for the 2005 data. In addition, we also used a face specific classifier, with features computed for specific facial parts, to facilitate answering people-dependent queries such as “government leader”. We submitted 3 different runs for this task. OXVGG_A was the result of using the dense visual features only. OXVGG_OJ was the result of using the sparse visual features for all the concepts, except for “government leader”, “face” and “person”, where we prepended the results from the face classifier. OXVGG_AOJ was a run where we applied rank fusion to merge the outputs from the sparse and dense methods with weightings tuned to the training data, and also prepended the face results for “face”, “person” and “government leader”. In general, the sparse features tended to perform best on the more object based concepts, such as “US flag”, while the dense features performed slightly better on more scene based concepts, such as “military”. Overall, the fused run did the best with a Mean Average (inferred) Precision (MAP) of 0.093, the sparse run came second with a MAP of 0.080, followed by the dense run with a MAP of 0.053. For the interactive search task, we coupled the results generated during the high-level task with methods to facilitate efficient and productive interactive search. Our system allowed for several “expansion” methods based on the sparse and dense features, as well as a novel on the fly face classification system, which coupled a Google Images search with rapid Support Vector Machine (SVM) training and testing to return results containing a particular person within a few minutes. We submitted just one run, OXVGG_TVI, which performed well, winning two categories and coming above the median in 18 out of 24 queries. 1 High-level Feature Extraction Our approach here is to train an SVM for the concept in question, then score all key frames in the test set by the magnitude of their discriminant (the distance from the discriminating hyper-plane), and subsequently rank the test shots by the score of their keyframes. We have developed three methods for this task, each differing in their features and/or kernel. Two of the methods are applicable to general visual categories (such as airplane, mountain and road) and the third is specific to faces. The first two methods differ in that one uses sparse (based on region detectors) monochrome features, and the other uses dense (on a regular pixel grid) colour features. We now describe the three methods in some detail.

TRECVID 2007 High-Level Feature Extraction By MCG-ICT-CAS

Sheng Tang

Yongdong Zhang

Jintao Li

Ming Li

Xu Zhang

Na Cai

Li Tan

Kun Tao

Shao-Xi Xu

Yuanyuan Ran

Ming Li

Sheng Tang

Yongdong Zhang

Jintao Li

Xu Zhang

Li Tan

Kun Tao

Shao-Xi Xu

Na Cai

Yuanyuan Ran

Abstract:We participated in the high-level feature extraction task in TRECVID 2007. This paper describes the details of our system for the task. For feature extraction, we propose an EMD-based bag-of-feature method to exploit visual/spatial information, and utilize WordNet to expand semantic meanings of text to boost up the generalization of detectors. We also explore audio features and extract the motion cues in compressed domain for detecting concepts highly associated with audio/motion. We use Ordered Weighted Average (OWA) fusion method to combine the SVM-based multi-modal concept detection results. Experiment results show that our methods are effective.

参考文献

[1] John F. Canny,et al. A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] Jin Zhao,et al. Video Retrieval Using High Level Features: Exploiting Query Matching and Confidence-Based Weighting , 2006, CIVR.

[3] Ahmed K. Elmagarmid,et al. InsightVideo: toward hierarchical video content organization for efficient browsing, summarization and retrieval , 2005, IEEE Transactions on Multimedia.

[4] John R. Smith,et al. Cluster-based data modeling for semantic video search , 2007, CIVR '07.

[5] Sheng Tang,et al. A density-based method for adaptive LDA model selection , 2009, Neurocomputing.

[6] J. Kacprzyk,et al. The Ordered Weighted Averaging Operators: Theory and Applications , 1997 .

[7] John R. Smith,et al. IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[8] Cordelia Schmid,et al. Human Detection Based on a Probabilistic Assembly of Robust Part Detectors , 2004, ECCV.

[9] Neil J. Gordon,et al. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[10] Lei Zhang,et al. Canny edge detection enhancement by scale multiplication , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11] G. Clark,et al. Reference , 2008 .

[12] King-Ip Lin,et al. The ANN-tree: an index for efficient approximate nearest neighbor search , 2001, Proceedings Seventh International Conference on Database Systems for Advanced Applications. DASFAA 2001.

[13] Rong Yan,et al. Filling the Semantic Gap in Video Retrieval: An Exploration , 2008 .

[14] Rong Yan,et al. Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News , 2007, IEEE Transactions on Multimedia.

[15] Gang Wang,et al. Exploring knowledge of sub-domain in a multi-resolution bootstrapping framework for concept detection in news video , 2008, ACM Multimedia.

[16] Trevor Darrell,et al. Efficient image matching with distributions of local invariant features , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[17] Jake K. Aggarwal,et al. A hierarchical Bayesian network for event recognition of human actions and interactions , 2004, Multimedia Systems.

[18] Rong Yan,et al. Semantic concept-based query expansion and re-ranking for multimedia retrieval , 2007, ACM Multimedia.

[19] Meng Wang,et al. MSRA-USTC-SJTU at TRECVID 2007: High-Level Feature Extraction and Search , 2007, TRECVID.

[20] Christiane Fellbaum,et al. Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[21] Milan Sonka,et al. Image Processing, Analysis and Machine Vision , 1993, Springer US.

[22] Boon-Lock Yeo,et al. On the extraction of DC sequence from MPEG compressed video , 1995, Proceedings., International Conference on Image Processing.

[23] Shih-Fu Chang,et al. CU-VIREO 374 : Fusing Columbia 374 and VIREO 374 for Large Scale Semantic Concept Detection , 2008 .

[24] Nikos Paragios,et al. Background modeling and subtraction of dynamic scenes , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[25] Li Chen,et al. Video copy detection: a comparative study , 2007, CIVR '07.

[26] Christopher G. Harris,et al. A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[27] Takeo Kanade,et al. An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[28] Paul A. Viola,et al. Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[29] Larry S. Davis,et al. W4: Real-Time Surveillance of People and Their Activities , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[30] Andrew Zisserman,et al. Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[31] Christiane Fellbaum,et al. Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[32] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[33] Yongdong Zhang,et al. Segregated feedback with performance-based adaptive sampling for interactive news video retrieval , 2007, ACM Multimedia.

[34] Richard Bowden,et al. Detection and Tracking of Humans by Probabilistic Body Part Assembly , 2005, BMVC.

[35] ZhangLei,et al. Canny Edge Detection Enhancement by Scale Multiplication , 2005 .

引用

Linguistic Patterns and Cross Modality-based Image Retrieval for Complex Queries

ICMR

2018

Ensemble Learning with LDA Topic Models for Visual Concept Detection

2012

Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification

IEEE Transactions on Multimedia

2012

Cross-Domain Concept Detection with Dictionary Coherence by Leveraging Web Images

MMM

2015

Semantic Video Annotation using Background Knowledge and Similarity-based Video Retrieval

TRECVID

2008

Group Sparse Ensemble Learning for Visual Concept Detection

PCM

2013

Sparse Ensemble Learning for Concept Detection

IEEE Transactions on Multimedia

2012

National Institute of Informatics, Japan at TRECVID 2008

TRECVID

2008

Beyond Semantic Search: What You Observe May Not Be What You Think

TRECVID

2008

Performance evaluation of early and late fusion methods for generic semantics indexing

Pattern Analysis and Applications

2013

TRECVID 2010 Known-item Search by NUS

TRECVID

2010

TRECVID 2007 High Level Feature Extraction experiments at JOANNEUM RESEARCH

TRECVID

2007

MovieBase: a movie database for event detection and behavioral analysis

WSMC '09

2009

Hierarchical BoW with segmental sparse coding for large scale image classification and retrieval

Multimedia Tools and Applications

2018

Web video retagging

Multimedia Tools and Applications

2011

MMM-TJU at TRECVID 2010

TRECVID

2010

THU and ICRC at TRECVID 2007

TRECVID

2007

TRECVID 2007 High-Level Feature Extraction By MCG-ICT-CAS

Linguistic Patterns and Cross Modality-based Image Retrieval for Complex Queries

Ensemble Learning with LDA Topic Models for Visual Concept Detection

Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification

Exploring large scale data for multimedia QA: an initial study

Large scale incremental web video categorization

FXPAL Interactive Search Experiments for TRECVID 2007

Cross-Domain Concept Detection with Dictionary Coherence by Leveraging Web Images

Semantic Video Annotation using Background Knowledge and Similarity-based Video Retrieval

Group Sparse Ensemble Learning for Visual Concept Detection

Sparse Ensemble Learning for Concept Detection

National Institute of Informatics, Japan at TRECVID 2008

Beyond Semantic Search: What You Observe May Not Be What You Think

Performance evaluation of early and late fusion methods for generic semantics indexing

TRECVID 2010 Known-item Search by NUS

TRECVID 2007 High Level Feature Extraction experiments at JOANNEUM RESEARCH

MovieBase: a movie database for event detection and behavioral analysis

Hierarchical BoW with segmental sparse coding for large scale image classification and retrieval

Web video retagging

MMM-TJU at TRECVID 2010

THU and ICRC at TRECVID 2007