Egocentric Visual Event Classification with Location-Based Priors

We present a method for visual classification of actions and events captured from an egocentric point of view. The method tackles the challenge of a moving camera by creating deformable graph models for classification of actions. Action models are learned from low resolution, roughly stabilized difference images acquired using a single monocular camera. In parallel, raw images from the camera are used to estimate the user's location using a visual Simultaneous Localization and Mapping (SLAM) system. Action-location priors, learned using a labeled set of locations, further aid action classification and bring events into context. We present results on a dataset collected within a cluttered environment, consisting of routine manipulations performed on objects without tags.

[1]  Masayuki Inaba,et al.  Learning by watching: extracting reusable task knowledge from visual observation of human performance , 1994, IEEE Trans. Robotics Autom..

[2]  David W. Murray,et al.  Video-rate localization in multiple maps for wearable augmented reality , 2008, 2008 12th IEEE International Symposium on Wearable Computers.

[3]  Kent Larson,et al.  Activity Recognition in the Home Using Simple and Ubiquitous Sensors , 2004, Pervasive.

[4]  Tomás Werner,et al.  A Linear Programming Approach to Max-Sum Problem: A Review , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Ian D. Reid,et al.  A Probabilistic Framework for Recognizing Similar Actions using Spatio-Temporal Features , 2007, BMVC.

[6]  Henry A. Kautz,et al.  Learning and inferring transportation routines , 2004, Artif. Intell..

[7]  Uwe Hansmann,et al.  Pervasive Computing , 2003 .

[8]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  Ying Wu,et al.  Discriminative subvolume search for efficient action detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Kristof Van Laerhoven,et al.  When Else Did This Happen? Efficient Subsequence Representation and Matching for Wearable Activity Data , 2009, 2009 International Symposium on Wearable Computers.

[11]  Alex Pentland,et al.  Recognizing user context via wearable sensors , 2000, Digest of Papers. Fourth International Symposium on Wearable Computers.

[12]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[13]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[14]  Walterio W. Mayol-Cuevas,et al.  Appearance Based Indexing for Relocalisation in Real-Time Visual SLAM , 2008, BMVC.

[15]  Yihong Gong,et al.  Action detection in complex scenes with spatial and temporal ambiguities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[16]  Geir Hovland,et al.  Skill acquisition from human demonstration using a hidden Markov model , 1996, Proceedings of IEEE International Conference on Robotics and Automation.

[17]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[18]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[19]  Ian D. Reid,et al.  Real-Time SLAM Relocalisation , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[20]  Horst Bischof,et al.  Sparse MRF Appearance Models for Fast Anatomical Structure Localisation , 2007, BMVC.

[21]  Alex Pentland,et al.  Realtime personal positioning system for a wearable computer , 1999, Digest of Papers. Third International Symposium on Wearable Computers.