论文信息 - Retina enhanced SIFT descriptors for video indexing

Retina enhanced SIFT descriptors for video indexing

This paper investigates how the detection of diverse high-level semantic concepts (objects, actions, scene types, persons etc.) in videos can be improved by applying a model of the human retina. A large part of the current approaches for Content-Based Image/Video Retrieval (CBIR/CBVR) relies on the Bag-of-Words (BoW) model, which has shown to perform well especially for object recognition in static images. Nevertheless, the current state-of-the-art framework shows its limits when applied to videos because of the added temporal information. In this paper, we enhance a BoW model based on the classical SIFT local spatial descriptor, by preprocessing videos with a model of the human retina. This retinal preprocessing allows the SIFT descriptor to become aware of temporal information. Our proposed descriptors extend the SIFT genericity to spatio-temporal content, making them interesting for generic video indexing. They also benefit from the retinal spatio-temporal “robustness” to various disturbances such as noise, compression artifacts, luminance variations or shadows. The proposed approaches are evaluated on the TRECVID 2012 Semantic Indexing task dataset.

Patrick Lambert | Alexandre Benoit | Sabin Tiberius Strat | P. Lambert | A. Benoît

[1] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[2] Matthijs C. Dorst. Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[3] Patrick Lambert,et al. Retina enhanced SURF descriptors for spatio-temporal concept detection , 2012, Multimedia Tools and Applications.

[4] Hervé Glotin,et al. IRIM at TRECVID 2014: Semantic Indexing and Instance Search , 2014, TRECVID.

[5] Andrew Zisserman,et al. Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[6] Luc Van Gool,et al. SURF: Speeded Up Robust Features , 2006, ECCV.

[7] Lei Wang,et al. In defense of soft-assignment coding , 2011, 2011 International Conference on Computer Vision.

[8] Paul Over,et al. Evaluation campaigns and TRECVid , 2006, MIR '06.

[9] Alice Caplier,et al. Using Human Visual System modeling for bio-inspired low level image processing , 2010, Comput. Vis. Image Underst..

[10] Nicolas Ballas,et al. Trajectories based descriptor for dynamic events annotation , 2011, J-MRE '11.

[11] Cordelia Schmid,et al. Actom sequence models for efficient action detection , 2011, CVPR 2011.

[12] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[13] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14] Stefan M. Rüger,et al. An Overview of Evaluation Campaigns in Multimedia Retrieval , 2010, ImageCLEF.

[15] Christopher Tsai. Effects of 2-D Preprocessing on Feature Extraction , 2008 .

[16] Stefanie Nowak,et al. The CLEF 2011 Photo Annotation and Concept-based Retrieval Tasks , 2011, CLEF.

[17] Georges Quénot,et al. Hierarchical Late Fusion for Concept Detection in Videos , 2014, Fusion in Computer Vision.

[18] Cordelia Schmid,et al. A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[19] Andrew Zisserman,et al. The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[20] Cordelia Schmid,et al. Action recognition by dense trajectories , 2011, CVPR 2011.

[21] Alexander G. Hauptmann,et al. MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[22] Emine Yilmaz,et al. A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.