Saliency-based object recognition in video

In this paper we study the problem of object recognition in egocentric video recorded with cameras worn by persons. This task has gained much attention during the last years, since it has turned to be a main building block for action recognition systems in applications involving wearable cameras, such as tele-medicine or lifelogging. Under these scenarios, an action can be effectively defined as a sequence of manipulated or observed objects, so that recognition becomes a relevant stage of the system. Furthermore, video summarization tasks on such content is also driven by appearance of semantic objects in camera field of view. One of the particularities of first-person camera videos is that they usually present a strong differentiation between active (manipulated or observed by the user wearing the camera) and passive objects (associated to background). In addition, spatial, temporal and geometric cues can be found in the video content that may help to identify the active elements in the scene. These saliency features are related to the modelling of Human Visual System, but also to motor coordination of eye, hand and body movements. In this paper, we discuss the automatic generation of saliency maps in video, and introduce a method that extends the well-known Bag-of-Words (BoW) paradigm with saliency information. We have assessed our proposal in several egocentric video datasets, demonstrating that it not only improves the BoW reference, but also achieves state-ofthe-art performance of e.g. part - based models, with noticeably lower computational times. The approach has tremendous perspectives for other User Generated mobile Content.

[1]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[2]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[4]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[5]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[6]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Jenny Benois-Pineau,et al.  Human Daily Activities Indexing in Videos from Wearable Cameras for Monitoring of Patients with Dementia Diseases , 2010, 2010 20th International Conference on Pattern Recognition.

[8]  M. Land,et al.  The Roles of Vision and Eye Movements in the Control of Activities of Daily Living , 1998, Perception.

[9]  Scott Daly,et al.  Engineering observations from spatiovelocity and spatiotemporal visual models , 1998, Electronic Imaging.

[10]  C. Koch,et al.  Computational modelling of visual attention , 2001, Nature Reviews Neuroscience.

[11]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Cordelia Schmid,et al.  Spatial Weighting for Bag-of-Features , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[13]  Christoph H. Lampert,et al.  Beyond sliding windows: Object localization by efficient subwindow search , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[15]  Christoph H. Lampert,et al.  Unsupervised Object Discovery: A Comparison , 2010, International Journal of Computer Vision.

[16]  Roberto Cipolla,et al.  Extracting Spatiotemporal Interest Points using Global Information , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[17]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[18]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[19]  Dominique Barba,et al.  Cartes de Saillance Spatio-Temporelle basées Contrastes de Couleur et Mouvement Relatif , 2009 .

[20]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[21]  Nicu Sebe,et al.  Image saliency by isocentric curvedness and color , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[22]  G. Farneback Fast and accurate motion estimation using orientation tensors and parametric motion models , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[23]  Thomas Deselaers,et al.  Measuring the Objectness of Image Windows , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Xiaofeng Ren,et al.  Figure-ground segmentation improves handled object recognition in egocentric video , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[27]  Jorge J. Moré,et al.  Computing a Trust Region Step , 1983 .

[28]  Jenny Benois-Pineau,et al.  Detection of moving foreground objects in videos with strong camera motion , 2011, Pattern Analysis and Applications.

[29]  Ofer Hadar,et al.  A metric for no-reference video quality assessment for HD TV delivery based on saliency maps , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[30]  Yoichi Sato,et al.  Coupling eye-motion and ego-motion features for first-person activity recognition , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[31]  Jenny Benois-Pineau,et al.  Fusion of Multiple Visual Cues for Visual Saliency Extraction from Wearable Camera Settings with Strong Motion , 2012, ECCV Workshops.

[32]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[33]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[34]  C. V. Jawahar,et al.  Generalized RBF feature maps for Efficient Detection , 2010, BMVC.

[35]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.