Gaze-Informed Egocentric Action Recognition for Memory Aid Systems

Egocentric action recognition has been intensively studied in the fields of computer vision and clinical science with applications in pervasive health-care. The majority of the existing egocentric action recognition techniques utilize the features extracted from either the entire contents or the regions of interest (ROI) in video frames as the inputs of action classifiers. The former might suffer from moving backgrounds or irrelevant foregrounds usually associated with egocentric action videos, while the latter may be impaired by the mismatch between the calculated and the ground truth ROI. This paper proposes a new gaze-informed feature extraction approach, by which the features are extracted from the regions around the gaze points and thus representing the genuine ROI from a first person of view. The activity of daily life can then be classified based only on the identified regions using the extracted gaze-informed features. The proposed approach has been further applied to a memory support system for people with poor memory, such as those with amnesia or dementia, and their carers. The experimental results demonstrate the efficacy of the proposed approach in egocentric action recognition, and thus the potential of the memory support tool in health care.

[1]  Martial Hebert,et al.  Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[2]  Irena Koprinska,et al.  Temporal video segmentation: A survey , 2001, Signal Process. Image Commun..

[3]  Hichem Snoussi,et al.  Internal Transfer Learning for Improving Performance in Human Action Recognition for Small Datasets , 2017, IEEE Access.

[4]  Young-Koo Lee,et al.  Human Action Recognition Using Adaptive Local Motion Descriptor in Spark , 2017, IEEE Access.

[5]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[6]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[7]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[8]  Liu Jun,et al.  A review of object detection based on convolutional neural network , 2017, 2017 36th Chinese Control Conference (CCC).

[9]  Andrew Zisserman,et al.  Three things everyone should know to improve object retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Graham Sexton,et al.  A Fall Detection/Recognition System and an Empirical Study of Gradient-Based Feature Extraction Approaches , 2017, UKCI.

[11]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[12]  Lothar R. Schad,et al.  Color-coded visualization of magnetic resonance imaging multiparametric maps , 2017, Scientific Reports.

[13]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Ajmal Mian,et al.  Learning a Deep Model for Human Action Recognition from Novel Viewpoints , 2016 .

[15]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[16]  Christof Koch,et al.  Image Signature: Highlighting Sparse Salient Regions , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[18]  Jean-Christophe Nebel,et al.  Recognition of Activities of Daily Living with Egocentric Vision: A Review , 2016, Sensors.

[19]  Alexandros Iosifidis,et al.  View-Invariant Action Recognition Based on Artificial Neural Networks , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[22]  Nicu Sebe,et al.  Histograms of Motion Gradients for real-time video classification , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[23]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Xuelong Li,et al.  Beyond Spatial Pyramids: A New Feature Extraction Framework with Dense Spatial Sampling for Image Classification , 2012, ECCV.

[25]  Takahiro Okabe,et al.  Can Saliency Map Models Predict Human Egocentric Visual Attention? , 2010, ACCV Workshops.

[26]  Limin Wang,et al.  Computer Vision and Image Understanding Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice , 2022 .

[27]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[28]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[30]  Larry H. Matthies,et al.  First-Person Activity Recognition: Feature, Temporal Structure, and Prediction , 2015, International Journal of Computer Vision.

[31]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[32]  Nicu Sebe,et al.  Video classification with Densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off , 2015, International Journal of Multimedia Information Retrieval.

[33]  Arnold W. M. Smeulders,et al.  Real-Time Visual Concept Classification , 2010, IEEE Transactions on Multimedia.