论文信息 - BBN VISER TRECVID 2011 Multimedia Event Detection System

BBN VISER TRECVID 2011 Multimedia Event Detection System

We describe the Raytheon BBN (BBN) VISER system that is designed to detect events of interest in multimedia data. We also present a comprehensive analysis of the different modules of that system in the context of the MED 2011 task. The VISER system incorporates a large set of low-level features that capture appearance, color, motion, audio, and audio-visual cooccurrence patterns in videos. For the low-level features, we rigorously analyzed several coding and pooling strategies, and also used state-of-the-art spatio-temporal pooling strategies to model relationships between different features. The system also uses high-level (i.e., semantic) visual information obtained from detecting scene, object, and action concepts. Furthermore, the VISER system exploits multimodal information by analyzing available spoken and videotext content using BBN's state-of-the-art Byblos automatic speech recognition (ASR) and video text recognition systems. These diverse streams of information are combined into a single, fixed dimensional vector for each video. We explored two different combination strategies: early fusion and late fusion. Early fusion was implemented through a fast kernel-based fusion framework and late fusion was performed using both Bayesian model combination (BAYCOM) as well as an innovative a weighted-average framework. Consistent with the previous MED’10 evaluation, low-level visual features exhibit strong performance and form the basis of our system. However, high-level information from speech, video-text, and object detection provide consistent and significant performance improvements. Overall, BBN’s VISER system exhibited the best performance among all the submitted systems with an average ANDC score of 0.46 across the 10 MED’11 test events when the threshold was optimized for the NDC score, and <30% missed detection rate when the threshold was optimized to minimize missed detections at 6% false alarm rate.

[1] Koen E. A. van de Sande,et al. Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] G LoweDavid,et al. Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[3] Qiang Yang,et al. Cross-domain sentiment classification via spectral feature alignment , 2010, WWW '10.

[4] Andrew W. Fitzgibbon,et al. Efficient Object Category Recognition Using Classemes , 2010, ECCV.

[5] Mubarak Shah,et al. Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching , 2010, TRECVID.

[6] Christopher Hunt,et al. Notes on the OpenSURF Library , 2009 .

[7] Cordelia Schmid,et al. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[8] Silvio Savarese,et al. Cross-view action recognition via view knowledge transfer , 2011, CVPR 2011.

[9] Bernd Girod,et al. Compressed Histogram of Gradients: A Low-Bitrate Descriptor , 2011, International Journal of Computer Vision.

[10] Jean Ponce,et al. Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11] Ivan Laptev,et al. On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[12] Stavros Tsakalidis,et al. Unsupervised Audio Analysis for Categorizing Heterogeneous Consumer Domain Videos , 2011, INTERSPEECH.

[13] Gabriela Csurka,et al. Visual categorization with bags of keypoints , 2002, eccv 2004.

[14] W. Marsden. I and J , 2012 .

[15] Pradeep Natarajan,et al. Efficient Orthogonal Matching Pursuit using sparse random projections for scene and video classification , 2011, 2011 International Conference on Computer Vision.

[16] Cordelia Schmid,et al. Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[17] Aaas News,et al. Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[18] David A. McAllester,et al. Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[20] S. V. N. Vishwanathan,et al. Multiple Kernel Learning and the SMO Algorithm , 2010, NIPS.

[21] Stavros Tsakalidis,et al. Audio-visual fusion using bayesian model combination for web video retrieval , 2011, MM '11.