BBNVISER : BBN VISER TRECVID 2012 Multimedia Event Detection and Multimedia Event Recounting Systems

We describe the Raytheon BBN Technologies (BBN) led VISER system for the TRECVID 2012 Multimedia Event Detection (MED) and Recounting (MER) tasks. We present a comprehensive analysis of the different modules in our evaluation system that includes: (1) a large suite of visual, audio and multimodal low-level features, (2) modules to detect semantic scene/action/object concepts over the entire video and within short temporal spans, (3) automatic speech recognition (ASR), and (4) videotext detection and recognition (OCR). For the low-level features we used multiple static, motion, color, and audio features previously considered in literature as well as a set of novel, fast kernel based feature descriptors developed recently by BBN. For the semantic concept detection systems, we leveraged BBN's natural language processing (NLP) technologies to automatically analyze and identify salient concepts from short textual descriptions of videos and frames. Then, we trained detectors for these concepts using visual and audio features. The semantic concept based systems enable rich description of video content for event recounting (MER). The video level concepts have the most coverage and can provide robust concept detections on most videos. Segment level concepts are less robust, but can provide sequence information that enriches recounting. Object detection, ASR and OCR are sporadic in occurrence but have high precision and improves quality of the recounting. For the MED task, we combined these different streams using multiple early/feature level and late/score level fusion strategies. We present a rigorous analysis of each of these subsystems and the impact of different fusion strategies. In particular, we present a thorough study of different semantic feature based systems compared to low-level feature based systems considered in most MED systems. Consistent with previous MED evaluations, low-level features exhibit strong performance. Further, semantic feature based systems have comparable performance to the low-level system, and produce gains in fusion. Overall, BBN's primary submission has an average missed detection rate of 29.6% with a false alarm rate of 2.6%. One of BBN's contrastive runs has <50% missed detection and <4% false alarm rates for all twenty events.

[1]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[2]  Dieter Fox,et al.  Object recognition with hierarchical kernel descriptors , 2011, CVPR 2011.

[3]  Xujun Peng,et al.  Text Extraction from Video Using Conditional Random Fields , 2011, 2011 International Conference on Document Analysis and Recognition.

[4]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[5]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[6]  Mubarak Shah,et al.  Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching , 2010, TRECVID.

[7]  Xujun Peng,et al.  Text detection and recognition in natural scenes and consumer videos , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Stavros Tsakalidis,et al.  Unsupervised Audio Analysis for Categorizing Heterogeneous Consumer Domain Videos , 2011, INTERSPEECH.

[9]  C. Schmid,et al.  Description of Interest Regions with Center-Symmetric Local Binary Patterns , 2006, ICVGIP.

[10]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[11]  Shinichi Nakajima,et al.  Nikon Multimedia Event Detection System , 2010, TRECVID.

[12]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[13]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Venu Govindaraju,et al.  Multilingual OCR research and applications: an overview , 2013, MOCR '13.

[15]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[16]  Silvio Savarese,et al.  Cross-view action recognition via view knowledge transfer , 2011, CVPR 2011.

[17]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[18]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  James Hays,et al.  SUN attribute database: Discovering, annotating, and recognizing scene attributes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Florian Metze,et al.  Informedia @ TRECVID 2011 , 2011 .

[21]  Andrew W. Fitzgibbon,et al.  Efficient Object Category Recognition Using Classemes , 2010, ECCV.

[22]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[24]  Bernd Girod,et al.  Compressed Histogram of Gradients: A Low-Bitrate Descriptor , 2011, International Journal of Computer Vision.

[25]  Andrew Zisserman,et al.  Sparse kernel approximations for efficient classification and detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Shuang Wu,et al.  Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[28]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Dong Liu,et al.  BBN VISER TRECVID 2011 Multimedia Event Detection System , 2011, TRECVID.

[30]  Pradeep Natarajan,et al.  Efficient Orthogonal Matching Pursuit using sparse random projections for scene and video classification , 2011, 2011 International Conference on Computer Vision.

[31]  Spyridon Matsoukas,et al.  Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[32]  Qiang Yang,et al.  Cross-domain sentiment classification via spectral feature alignment , 2010, WWW '10.

[33]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[34]  Dieter Fox,et al.  Kernel Descriptors for Visual Recognition , 2010, NIPS.

[35]  GeversTheo,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010 .

[36]  Cordelia Schmid,et al.  AXES at TRECVid 2013 , 2013 .

[37]  W. Marsden I and J , 2012 .

[38]  Andrew Zisserman,et al.  Three things everyone should know to improve object retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[40]  Stavros Tsakalidis,et al.  Audio-visual fusion using bayesian model combination for web video retrieval , 2011, MM '11.

[41]  Shuang Wu,et al.  Multi-channel Shape-Flow Kernel Descriptors for Robust Video Event Detection and Retrieval , 2012, ECCV.

[42]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[43]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[44]  Cordelia Schmid,et al.  AXES at TRECVID 2012: KIS, INS, and MED , 2012, TRECVID.

[45]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[46]  A. Smeaton,et al.  TRECVID 2013 -- An Overview of the Goals, Tasks, Data, Evaluation Mechanisms, and Metrics | NIST , 2011 .

[47]  S. V. N. Vishwanathan,et al.  Multiple Kernel Learning and the SMO Algorithm , 2010, NIPS.

[48]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.