Visual search for objects in a complex visual context: what we wish to see

In this work we propose a saliency based psycho-visual weighting of the BoVW for object recognition. This approach is designed to identify objects related to IADL on videos recorded by a wearable camera. These recording give an egocentric point-of-view on the upcoming action. This point- of-view is also characterized by a complex visual scene with several objects on the frame plan. The human visual system functions is a way to process only the relevant data by considering areas of interest. Based on this idea, we propose a new approach by introducing saliency models to discard irrelevant information in the video frames. Therefore we apply a visual saliency model to weight the image signature within the BoVW framework. Visual saliency is well suited for catching spatio-temporal information related to the observer's attention on the video frame. We also proposed an additional geometric saliency cue that models the anticipation phenomenon observed on subjects watching video content from the wearable camera. The findings show that discarding irrelevant features gives better performances when compared to the baseline method which consider the whole set of features in the images.

[1]  Jenny Benois-Pineau,et al.  Fusion of Multiple Visual Cues for Visual Saliency Extraction from Wearable Camera Settings with Strong Motion , 2012, ECCV Workshops.

[2]  Michael J. Swain,et al.  Color indexing , 1991, International Journal of Computer Vision.

[3]  Jenny Benois-Pineau,et al.  Multi-layer Local Graph Words for Object Recognition , 2012, MMM.

[4]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[5]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[6]  David S Wooding,et al.  Eye movements of large populations: II. Deriving regions of interest, coverage, and similarity using fixation maps , 2002, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[7]  Rami Albatal,et al.  Visual Phrases for automatic images annotation , 2010, 2010 International Workshop on Content Based Multimedia Indexing (CBMI).

[8]  Benjamin W Tatler,et al.  The central fixation bias in scene viewing: selecting an optimal viewing position independently of motor biases and image feature distributions. , 2007, Journal of vision.

[9]  Bo Zhang,et al.  An effective region-based image retrieval framework , 2002, MULTIMEDIA '02.

[10]  F BobickAaron,et al.  The Recognition of Human Movement Using Temporal Templates , 2001 .

[11]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[12]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Sugato Chakravarty,et al.  Methodology for the subjective assessment of the quality of television pictures , 1995 .

[15]  Scott Daly,et al.  Engineering observations from spatiovelocity and spatiotemporal visual models , 1998, Electronic Imaging.

[16]  Farzin Mokhtarian,et al.  Robust Image Corner Detection Through Curvature Scale Space , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Fuhui Long,et al.  Fundamentals of Content-Based Image Retrieval , 2003 .

[18]  Dominique Barba,et al.  Cartes de Saillance Spatio-Temporelle basées Contrastes de Couleur et Mouvement Relatif , 2009 .

[19]  Bärbel Mertsching,et al.  Fast and Robust Generation of Feature Maps for Region-Based Visual Attention , 2008, IEEE Transactions on Image Processing.

[20]  Alex Pentland,et al.  Visual contextual awareness in wearable computing , 1998, Digest of Papers. Second International Symposium on Wearable Computers (Cat. No.98EX215).

[21]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Nathalie Guyader,et al.  Modelling Spatio-Temporal Saliency to Predict Gaze Direction for Short Videos , 2009, International Journal of Computer Vision.

[23]  L. Itti Author address: , 1999 .

[24]  Matthai Philipose,et al.  Egocentric recognition of handled objects: Benchmark and analysis , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[25]  Hervé Glotin,et al.  IRIM at TRECVID 2014: Semantic Indexing and Instance Search , 2014, TRECVID.

[26]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[27]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[28]  Patrick Lambert,et al.  A color-action perceptual approach to the classification of animated movies , 2011, ICMR '11.

[29]  Michael Dorr,et al.  Space-Variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements , 2012, ECCV.

[30]  Heinz Hügli,et al.  Optimal Cue Combination for Saliency Computation: A Comparison with Human Vision , 2007, IWINAC.

[31]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  J.-Y. Bouguet,et al.  Pyramidal implementation of the lucas kanade feature tracker , 1999 .

[33]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Donald C. Hood,et al.  Sensitivity to Light , 1986 .

[35]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[36]  C. Koch,et al.  Computational modelling of visual attention , 2001, Nature Reviews Neuroscience.

[37]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[38]  Hichem Sahbi,et al.  Robust matching and recognition using context-dependent kernels , 2008, ICML '08.

[39]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  B. S. Manjunath,et al.  Color and texture descriptors , 2001, IEEE Trans. Circuits Syst. Video Technol..

[41]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[42]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[43]  O. Meur,et al.  Predicting visual fixations on video based on low-level visual features , 2007, Vision Research.

[44]  Thomas Martinetz,et al.  Variability of eye movements when viewing dynamic natural scenes. , 2010, Journal of vision.

[45]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[46]  A. Treisman,et al.  A feature-integration theory of attention , 1980, Cognitive Psychology.

[47]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[48]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  H. Ritter,et al.  Disambiguating Complex Visual Information: Towards Communication of Personal Views of a Scene , 1996, Perception.

[51]  M. Land,et al.  The Roles of Vision and Eye Movements in the Control of Activities of Daily Living , 1998, Perception.

[52]  Ofer Hadar,et al.  A metric for no-reference video quality assessment for HD TV delivery based on saliency maps , 2011, 2011 IEEE International Conference on Multimedia and Expo.