TNO at TRECVID 2013: Multimedia Event Detection and Instance Search

We describe the TNO system and the evaluation results for TRECVID 2013 Multimedia Event Detection (MED) and instance search (INS) tasks. The MED system consists of a bag-of-word (BOW) approach with spatial tiling that uses low-level static and dynamic visual features, an audio feature and high-level concepts. Automatic speech recognition (ASR) and optical character recognition (OCR) are not used in the system. In the MED case with 100 example training videos, support-vector machines (SVM) are trained and fused to detect an event in the test set. In the case with 0 example videos, positive and negative concepts are extracted as keywords from the textual event description and events are detected with the high-level concepts. The MED results show that the SIFT keypoint descriptor is the one which contributes best to the results, fusion of multiple low-level features helps to improve the performance, and the textual event-description chain currently performs poorly. The TNO INS system presents a baseline open-source approach using standard SIFT keypoint detection and exhaustive matching. In order to speed up search times for queries a basic map-reduce scheme is presented to be used on a multi-node cluster. Our INS results show above-median results with acceptable search times.

[1]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Wessel Kraaij,et al.  Notebook paper: TNO instance search submission 2012 , 2011 .

[3]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Afshin Dehghan,et al.  SRI-Sarnoff AURORA System at TRECVID 2013 Multimedia Event Detection and Recounting , 2013, TRECVID.

[5]  A. Smeaton,et al.  TRECVID 2013 -- An Overview of the Goals, Tasks, Data, Evaluation Mechanisms, and Metrics | NIST , 2011 .

[6]  C. Schmid,et al.  Description of Interest Regions with Center-Symmetric Local Binary Patterns , 2006, ICVGIP.

[7]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[8]  Zheng Fang,et al.  Comparison of different implementations of MFCC , 2001 .

[9]  Chenliang Xu,et al.  TRECVID 2012 GENIE: Multimedia Event Detection and Recounting , 2012, TRECVID.

[10]  Daniel P. W. Ellis,et al.  IBM Research and Columbia University TRECVID-2012 Multimedia Event Detection (MED), Multimedia Event Recounting (MER), and Semantic Indexing (SIN) Systems , 2012, TRECVID.

[11]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Richard M. Stern,et al.  Informedia e-lamp @ TRECVID 2012 multimedia event detection and recounting MED and MER , 2012 .

[14]  Klamer Schutte,et al.  Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos , 2013, Machine Vision and Applications.

[15]  Dong Liu,et al.  BBNVISER : BBN VISER TRECVID 2012 Multimedia Event Detection and Multimedia Event Recounting Systems , 2012, TRECVID.

[16]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[18]  Koichi Shinoda,et al.  TokyoTechCanon at TRECVID 2012 , 2012, TRECVID.

[19]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[20]  Wessel Kraaij,et al.  GOOSE: semantic search on internet connected sensors , 2013, Defense, Security, and Sensing.

[21]  Matti Pietikäinen,et al.  Rotation-Invariant Image and Video Description With Local Binary Pattern Features , 2012, IEEE Transactions on Image Processing.

[22]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[23]  Dennis Koelma,et al.  The MediaMill TRECVID 2008 Semantic Video Search Engine , 2008, TRECVID.

[24]  Cordelia Schmid,et al.  AXES at TRECVID 2012: KIS, INS, and MED , 2012, TRECVID.

[25]  Henri Bouma,et al.  Recognition and localization of relevant human behavior in videos , 2013, Defense, Security, and Sensing.

[26]  Paul Over,et al.  Creating HAVIC: Heterogeneous Audio Visual Internet Collection , 2012, LREC.

[27]  Stéphane Ayache,et al.  Video Corpus Annotation Using Active Learning , 2008, ECIR.

[28]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[29]  Shuang Wu,et al.  Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[31]  Chong-Wah Ngo,et al.  Semantic Indexing and Multimedia Event Detection: ECNU at TRECVID 2012 , 2012, TRECVID.

[32]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[33]  Thomas Fang Zheng,et al.  Comparison of different implementations of MFCC , 2001, Journal of Computer Science and Technology.