Retina-enhanced SURF descriptors for semantic concept detection in videos

This paper proposes to investigate the potential benefit of the use of low-level human vision behaviors in the context of high-level semantic concept detection. A large part of the current approaches relies on the Bag-of-Words (BoW) model, which has proven itself to be a good choice especially for object recognition in images. Its extension from static images to video sequences exhibits some new problems to cope with, mainly the way to use the added temporal dimension for detecting the target concepts (swimming, drinking...). In this study, we propose to apply a human retina model to preprocess video sequences, before constructing a State-Of-The-Art BoW analysis. This preprocessing, designed in a way that enhances the appearance especially of static image elements, increases the performance by introducing robustness to traditional image and video problems, such as luminance variation, shadows, compression artifacts and noise. These approaches are evaluated on the TrecVid 2010 Semantic Indexing task datasets, containing 130 high-level semantic concepts. We consider the well-known SURF descriptor as the entry point of the BoW system, but this work could be extended to any other local gradient based descriptor.

[1]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[2]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[3]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[4]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[5]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[6]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[7]  Jeanny Herault Vision: Images, Signals and Neural Networks - Models of Neural Processing in Visual Perception , 2010 .

[8]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[10]  Alice Caplier,et al.  Using Human Visual System modeling for bio-inspired low level image processing , 2010, Comput. Vis. Image Underst..

[11]  S. Govindarajulu,et al.  A Comparison of SIFT, PCA-SIFT and SURF , 2012 .

[12]  Luo Juan,et al.  A comparison of SIFT, PCA-SIFT and SURF , 2009 .

[13]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[14]  Hervé Glotin,et al.  IRIM at TRECVID 2014: Semantic Indexing and Instance Search , 2014, TRECVID.

[15]  Koen E. A. van de Sande,et al.  A comparison of color features for visual concept classification , 2008, CIVR '08.

[16]  Paul Over,et al.  High-level feature detection from video in TRECVid: a 5-year retrospective of achievements , 2009 .

[17]  Nicolas Ballas,et al.  Trajectories based descriptor for dynamic events annotation , 2011, J-MRE '11.

[18]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.