Retina enhanced SIFT descriptors for video indexing

This paper investigates how the detection of diverse high-level semantic concepts (objects, actions, scene types, persons etc.) in videos can be improved by applying a model of the human retina. A large part of the current approaches for Content-Based Image/Video Retrieval (CBIR/CBVR) relies on the Bag-of-Words (BoW) model, which has shown to perform well especially for object recognition in static images. Nevertheless, the current state-of-the-art framework shows its limits when applied to videos because of the added temporal information. In this paper, we enhance a BoW model based on the classical SIFT local spatial descriptor, by preprocessing videos with a model of the human retina. This retinal preprocessing allows the SIFT descriptor to become aware of temporal information. Our proposed descriptors extend the SIFT genericity to spatio-temporal content, making them interesting for generic video indexing. They also benefit from the retinal spatio-temporal “robustness” to various disturbances such as noise, compression artifacts, luminance variations or shadows. The proposed approaches are evaluated on the TRECVID 2012 Semantic Indexing task dataset.

[1]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[2]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[3]  Patrick Lambert,et al.  Retina enhanced SURF descriptors for spatio-temporal concept detection , 2012, Multimedia Tools and Applications.

[4]  Hervé Glotin,et al.  IRIM at TRECVID 2014: Semantic Indexing and Instance Search , 2014, TRECVID.

[5]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[6]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[7]  Lei Wang,et al.  In defense of soft-assignment coding , 2011, 2011 International Conference on Computer Vision.

[8]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[9]  Alice Caplier,et al.  Using Human Visual System modeling for bio-inspired low level image processing , 2010, Comput. Vis. Image Underst..

[10]  Nicolas Ballas,et al.  Trajectories based descriptor for dynamic events annotation , 2011, J-MRE '11.

[11]  Cordelia Schmid,et al.  Actom sequence models for efficient action detection , 2011, CVPR 2011.

[12]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[13]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Stefan M. Rüger,et al.  An Overview of Evaluation Campaigns in Multimedia Retrieval , 2010, ImageCLEF.

[15]  Christopher Tsai Effects of 2-D Preprocessing on Feature Extraction , 2008 .

[16]  Stefanie Nowak,et al.  The CLEF 2011 Photo Annotation and Concept-based Retrieval Tasks , 2011, CLEF.

[17]  Georges Quénot,et al.  Hierarchical Late Fusion for Concept Detection in Videos , 2014, Fusion in Computer Vision.

[18]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[20]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[21]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[22]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.