Egocentric activity recognition by leveraging multiple mid-level representations

Existing approaches for egocentric activity recognition mainly rely on a single modality (e.g., detecting interacting objects) to infer the activity category. However, due to the inconsistency between camera angle and subject's visual field, important objects may be partially occluded or missing in the video frames. Moreover, where the objects are and how we interact with the objects are usually ignored in prior works. To resolve these difficulties, we propose multiple mid-level representations (e.g., objects manipulated by a user, background context, and motion patterns of hands) to compensate the insufficiency of a single modality, and jointly consider what, where, and how a subject is interacting with. To evaluate the method, we introduce a new and challenging egocentric activity dataset (ADL+) that contains video and wrist-worn accelerometer data of people performing daily-life activities. Our approach significantly outperforms the state-of-the-art method on the public ADL dataset (i.e., 36.8% to 46.7%) and our ADL+ dataset (i.e., 32.1 % to 60.0%) in terms of classification accuracy. In addition, we also conduct a series of analyses to explore relative merits of each modality to egocentric activity recognition.

[1]  Tullio Vernazza,et al.  Analysis of human behavior recognition algorithms based on acceleration data , 2013, 2013 IEEE International Conference on Robotics and Automation.

[2]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[3]  Sei Naito,et al.  An Attention-Based Activity Recognition for Egocentric Video , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[4]  James M. Rehg,et al.  Delving into egocentric actions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[6]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[7]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[10]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[11]  A. Catz,et al.  SCIM – spinal cord independence measure: a new disability scale for patients with spinal cord lesions , 1997, Spinal Cord.

[12]  James M. Rehg,et al.  A Scalable Approach to Activity Recognition based on Object Use , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[13]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[14]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, CVPR.

[15]  Tsuhan Chen,et al.  Spatio-Temporal Phrases for Activity Recognition , 2012, ECCV.

[16]  Subhransu Maji,et al.  Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[17]  Bernt Schiele,et al.  A tutorial on human activity recognition using body-worn inertial sensors , 2014, CSUR.