IBM Research TRECVID-2010 Video Copy Detection and Multimedia Event Detection System

In this paper, we describe the system jointly developed by IBM Research and Columbia University for video copy detection and multimedia event detection applied to the TRECVID-2010 video retrieval benchmark. A. Content-Based Copy Detection: The focus of our copy detection system this year was fusing three types of complementary fingerprints: a keyframe-based color correlogram, SIFTogram (bag of visual words), and a GIST-based fingerprint. However, in our official submissions, we did not use the color correlogram component since our best results on the training set came from the GIST and SIFTogram components. A summary of our runs is listed below: 1. IBM.m.nofa.gistG: A run based on the grayscale GIST frame-level feature, with at most 1 result per query, except in the case of ties. 2. IBM.m.balanced.gistG: As in the above run, but with including more results per query, though on average still less than 2. 3. IBM.m.nofa.gistGC: The result of the nofa.gistG run, fused with results from GIST features extracted from the R,G,B color channels. 4. IBM.m.nofa.gistGCsift: The result of the nofa.gistGC run, fused with a SIFTogram result. Overall, the grayscale GIST approach performed best. We found it produced excellent results when tested on the ∗IBM T. J. Watson Research Center, Hawthorne, NY, USA †Dept. of Computer Science, Columbia University ‡College of Computing, Georgia Tech §Dept. of Electrical Engineering, Duke University TRECVID-2009 data set, with an optimal NDCR that surpassed what we had achieved with SIFTogram previously. The “gistG” runs also outperformed our other runs on the 2010 data, although we changed the SIFT implementation we used this year which made it not directly comparable with our previous TRECVID results. Our system did not make use of any audio features. B. Multimedia Event Detection: Our MED system has three aspects to its design – a variety of global, local, and spatial-temporal descriptors; building detectors from a large-scale semantic basis, and designing temporal motif features: 1. IBM-CU 2010 MED EVAL cComboAll 1 : Combination of all classifiers. 2. IBM-CU 2010 MED EVAL pComboIBM+CUHOF 1 : Combination of global image features, spatial-temporal interest points, audio features, and model vector classifiers. 3. IBM-CU 2010 MED EVAL cComboStatic 1 : Combination of global image features, and model vector classifiers. 4. IBM-CU 2010 MED EVAL cComboDynamic 1 : Combination of spatial-temporal interest points, audio features, temporal motif, and HMM classifiers. 5. IBM-CU 2010 MED EVAL cComboIBM+CUHOF 2 :Combination of global image features, spatial-temporal interest points, audio features, and model vector classifiers. 6. IBM-CU 2010 MED EVAL cComboIBM-HOF 1 : Combination of global image features, spatialtemporal HOG points, and model vector classifiers. 7. IBM-CU 2010 MED EVAL cComboIBM 1 : Combination of global image features, spatialtemporal interest points, and model vector classifiers. 8. IBM-CU 2010 MED EVAL cmodelVectorAvg 1 : Run with 272 semantic model vector features. 9. IBM-CU 2010 MED EVAL cTemporalMotifs 1 : Semantic model vector feature with sequential motifs. 10. IBM-CU 2010 MED EVAL cmvxhmm 1 : Semantic model vector feature with hierarchical HMM state histograms. Overall, the semantic model vector is our best-performing single feature, while the combination of dynamic features outperforms the static features, and temporal motif and hierarchical HMMs show promising performance.

[1]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[2]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Alexei A. Efros,et al.  Discovering object categories in image collections , 2005 .

[4]  Dong Xu,et al.  Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Jing Huang,et al.  Spatial Color Indexing and Applications , 2004, International Journal of Computer Vision.

[6]  Shih-Fu Chang,et al.  Short-term audio-visual atoms for generic video concept classification , 2009, ACM Multimedia.

[7]  Paul Over,et al.  High-level feature detection from video in TRECVid: a 5-year retrospective of achievements , 2009 .

[8]  John R. Smith,et al.  Multimedia semantic indexing using model vectors , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[9]  Shahram Ebadollahi,et al.  Visual Event Detection using Multi-Dimensional Concept Dynamics , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[10]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[11]  A FischlerMartin,et al.  Random sample consensus , 1981 .

[12]  Shih-Fu Chang,et al.  Pattern Mining in Visual Concept Streams , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[13]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[14]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[15]  Cor J. Veenman,et al.  Comparing compact codebooks for visual categorization , 2010, Comput. Vis. Image Underst..

[16]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[17]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[18]  Shih-Fu Chang,et al.  Unsupervised Mining of Statistical Temporal Structures in Video , 2003 .