Semantic Concept Detection Using Dense Codeword Motion

When detecting semantic concepts in video, much of the existing research in content-based classification uses keyframe information only. Particularly the combination between local features such as SIFT and the Bag of Words model is very popular with TRECVID participants. The few existing motion and spatiotemporal descriptors are computationally heavy and become impractical when applied on large datasets such as TRECVID. In this paper, we propose a way to efficiently combine positional motion obtained from optic flow in the keyframe with information given by the Dense SIFT Bag of Words feature. The features we propose work by spatially binning motion vectors belonging to the same codeword into separate histograms describing movement direction (left, right, vertical, zero, etc.). Classifiers are mapped using the homogeneous kernel map techinque for approximating the i¾?2 kernel and then trained efficiently using linear SVM. By using a simple linear fusion technique we can improve the Mean Average Precision of the Bag of Words DSIFT classifier on the TRECVID 2010 Semantic Indexing benchmark from 0.0924 to 0.0972, which is confirmed to be a statistically significant increase based on standardized TRECVID randomization tests.

[1]  Chong-Wah Ngo,et al.  Video event detection using motion relativity and visual relatedness , 2008, ACM Multimedia.

[2]  Hervé Glotin,et al.  IRIM at TRECVID 2014: Semantic Indexing and Instance Search , 2014, TRECVID.

[3]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[4]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[5]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[6]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[7]  Frédéric Jurie,et al.  Creating efficient codebooks for visual recognition , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[8]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[10]  Nazli Ikizler-Cinbis,et al.  Object, Scene and Actions: Combining Multiple Features for Human Action Recognition , 2010, ECCV.

[11]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[13]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[14]  Thomas Deselaers,et al.  ClassCut for Unsupervised Class Segmentation , 2010, ECCV.

[15]  Leszek Wojnar,et al.  Image Analysis , 1998 .

[16]  Thomas S. Huang,et al.  Image Classification Using Super-Vector Coding of Local Image Descriptors , 2010, ECCV.

[17]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[18]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[19]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.