Local Invariant Feature Tracks for high-level video feature extraction

This paper builds upon previous work on local interest point detection and description to propose the extraction and representation of novel Local Invariant Feature Tracks (LIFT). These features compactly capture not only the spatial attributes of 2D local regions, as in SIFT and related techniques, but also their long-term trajectories in time. This and other desirable properties of LIFT allow the generation of Bags-of-Spatiotemporal-Words models that facilitate capturing the dynamics of video content, which is necessary for detecting high-level video features that by definition have a strong temporal dimension. Preliminary experimental evaluation and comparison of the proposed approach reveals promising results.

[1]  Cordelia Schmid,et al.  Spatial pyramid matching , 2009 .

[2]  Riccardo Leonardi,et al.  Analysis, Retrieval and Delivery of Multimedia Content , 2012 .

[3]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[4]  Yiannis Kompatsiaris,et al.  ITI-CERTH participation to TRECVID 2015 , 2015, TRECVID.

[5]  Michael G. Strintzis,et al.  Real-time compressed-domain spatiotemporal segmentation and ontologies for video indexing and retrieval , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[6]  Alberto Del Bimbo,et al.  Video event classification using string kernels , 2010, Multimedia Tools and Applications.

[7]  Huiyu Zhou,et al.  Object tracking using SIFT features and mean shift , 2009, Comput. Vis. Image Underst..

[8]  Michael G. Strintzis,et al.  Video object segmentation using Bayes-based temporal tracking and trajectory-based region merging , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[9]  Jintao Li,et al.  Hierarchical spatio-temporal context modeling for action recognition , 2009, CVPR.

[10]  Yiannis Kompatsiaris,et al.  On the Use of Visual Soft Semantics for Video Temporal Decomposition to Scenes , 2010, 2010 IEEE Fourth International Conference on Semantic Computing.

[11]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[12]  Michel Barlaud,et al.  Combining spatial and temporal patches for scalable video indexing , 2010, Multimedia Tools and Applications.

[13]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[14]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[15]  Stéphane Marchand-Maillet,et al.  Local Feature Trajectories for Efficient Event-Based Indexing of Video Sequences , 2006, CIVR.

[16]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[17]  Cedric Nishan Canagarajah,et al.  A Unified Framework for Object Retrieval and Mining , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Gertjan J. Burghouts,et al.  Performance evaluation of local colour invariants , 2009, Comput. Vis. Image Underst..

[19]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[20]  Dhiraj Joshi,et al.  Object Categorization: Computer and Human Vision Perspectives , 2008 .

[21]  Yiannis Kompatsiaris,et al.  Automatic event-based indexing of multimedia content using a joint content-event model , 2010, EiMM '10.

[22]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[24]  Dennis Koelma,et al.  The MediaMill TRECVID 2008 Semantic Video Search Engine , 2008, TRECVID.

[25]  Cordelia Schmid,et al.  A performance evaluation of local descriptors , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Hironobu Fujiyoshi,et al.  A Method for Visualizing Pedestrian Traffic Flow Using SIFT Feature Point Tracking , 2009, PSIVT.

[27]  Paul Over,et al.  High-level feature detection from video in TRECVid: a 5-year retrospective of achievements , 2009 .

[28]  Sven J. Dickinson,et al.  Object Categorization: Computer and Human Vision Perspectives , 2009 .

[29]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.