论文信息 - Informedia @ TRECVID 2011

Informedia @ TRECVID 2011

The Informedia group participated in three tasks this year, including: Multimedia Event Detection (MED), Semantic Indexing (SIN) and Surveillance Event Detection. Generally, all of these tasks consist of three main steps: extracting feature, training detector and fusing. In the feature extraction part, we extracted a lot of low-level features, high-level features and text features. Especially, we used the Spatial-Pyramid Matching technique to represent the low-level visual local features, such as SIFT and MoSIFT, which describe the location information of feature points. In the detector training part, besides the traditional SVM, we proposed a Sequential Boosting SVM classifier to deal with the large-scale unbalance classification problem. In the fusion part, to take the advantages from different features, we tried three different fusion methods: early fusion, late fusion and double fusion. Double fusion is a combination of early fusion and late fusion. The experimental results demonstrated that double fusion is consistently better, or at least comparable than early fusion and late fusion.

[1] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[2] Paul Over,et al. Evaluation campaigns and TRECVid , 2006, MIR '06.

[3] Ivan Laptev,et al. On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[4] Datong Chen,et al. Improving multimedia retrieval with a video OCR , 2008, Electronic Imaging.

[5] Mubarak Shah,et al. Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching , 2010, TRECVID.

[6] A. Waibel,et al. A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[7] Cordelia Schmid,et al. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[8] Paul A. Viola,et al. Detecting Pedestrians Using Patterns of Motion and Appearance , 2005, International Journal of Computer Vision.

[9] Alexander G. Hauptmann,et al. MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[10] Koen E. A. van de Sande,et al. Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11] Leo Breiman,et al. Bagging Predictors , 1996, Machine Learning.

[12] Harriet J. Nock,et al. Discriminative model fusion for semantic concept detection and annotation in video , 2003, ACM Multimedia.

[13] Yoram Singer,et al. Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[14] Xuelong Li,et al. Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[16] Mehryar Mohri,et al. L2 Regularization for Learning Kernels , 2009, UAI.

[17] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .