Deep appearance and motion learning for egocentric activity recognition

Abstract Egocentric activity recognition has recently generated great popularity in computer vision due to its widespread applications in egocentric video analysis. However, it poses new challenges comparing to the conventional third-person activity recognition tasks, which are caused by significant body shaking, varied lengths, and poor recoding quality, etc. To handle these challenges, in this paper, we propose deep appearance and motion learning (DAML) for egocentric activity recognition, which leverages the great strength of deep learning networks in feature learning. In contrast to hand-crafted visual features or pre-trained convolutional neural network (CNN) features with limited generality to new egocentric videos, the proposed DAML is built on the deep autoencoder (DAE), and directly extracts appearance and motion feature, the main cue of activities, from egocentric videos. The DAML takes advantages of the great effectiveness and efficiency of the DAE in unsupervised feature learning, which provides a new representation learning framework of egocentric videos. The learned appearance and motion features by the DAML are seamlessly fused to accomplish a rich informative egocentric activity representation which can be readily fed into any supervised learning models for activity recognition. Experimental results on two challenging benchmark datasets show that the DAML achieves high performance on both short- and long-term egocentric activity recognition tasks, which is comparable to or even better than the state-of-the-art counterparts.

[1]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[2]  Jingkuan Song,et al.  Real-time social media retrieval with spatial, temporal and social constraints , 2017, Neurocomputing.

[3]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[4]  Dit-Yan Yeung,et al.  Learning a Deep Compact Image Representation for Visual Tracking , 2013, NIPS.

[5]  Martial Hebert,et al.  Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[6]  Nicu Sebe,et al.  Optimized Graph Learning Using Partial Tags and Multiple Features for Image and Video Annotation , 2016, IEEE Transactions on Image Processing.

[7]  Ling Shao,et al.  Embedding Motion and Structure Features for Action Recognition , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  James M. Rehg,et al.  Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Dacheng Tao,et al.  Slow Feature Analysis for Human Action Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Shmuel Peleg,et al.  Egocentric Video Biometrics , 2014, ArXiv.

[12]  Stefan Lee,et al.  Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Peng Jin,et al.  Fast reference frame selection based on content similarity for low complexity HEVC encoder , 2016, J. Vis. Commun. Image Represent..

[14]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Fang Liu,et al.  Simple to Complex Transfer Learning for Action Recognition , 2016, IEEE Transactions on Image Processing.

[16]  Shmuel Peleg,et al.  Compact CNN for indexing egocentric videos , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[17]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[18]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[19]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[20]  Tieniu Tan,et al.  Feature Coding in Image Classification: A Comprehensive Study , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Zi Huang,et al.  Effective Multiple Feature Hashing for Large-Scale Near-Duplicate Video Retrieval , 2013, IEEE Transactions on Multimedia.

[22]  Bingbing Ni,et al.  Cascaded Interactional Targeting Network for Egocentric Video Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jie Lin,et al.  Egocentric activity recognition with multimodal fisher vector , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[25]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[26]  Yoichi Sato,et al.  Recognizing Micro-Actions and Reactions from Paired Egocentric Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jing Liu,et al.  Robust Structured Subspace Learning for Data Representation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[29]  Qing Tian,et al.  Cross-heterogeneous-database age estimation through correlation representation learning , 2017, Neurocomputing.

[30]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[31]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[32]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Meng Wang,et al.  3D deep shape descriptor , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Yang Wang,et al.  Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Pierre Baldi,et al.  Autoencoders, Unsupervised Learning, and Deep Architectures , 2011, ICML Unsupervised and Transfer Learning.

[36]  Larry H. Matthies,et al.  Pooled motion features for first-person videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Dacheng Tao,et al.  Temporal Variance Analysis for Action Recognition , 2015, IEEE Transactions on Image Processing.

[38]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[39]  Meng Wang,et al.  Play and Rewind: Optimizing Binary Representations of Videos by Self-Supervised Temporal Hashing , 2016, ACM Multimedia.

[40]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[41]  Jinhui Tang,et al.  Weakly Supervised Deep Matrix Factorization for Social Image Understanding , 2017, IEEE Transactions on Image Processing.

[42]  Cees G. M. Snoek,et al.  University of Amsterdam at THUMOS Challenge 2014 , 2014 .

[43]  Matthias Rauterberg,et al.  The Evolution of First Person Vision Methods: A Survey , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[44]  Raja Giryes,et al.  Autoencoders , 2020, ArXiv.

[45]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[46]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Shmuel Peleg,et al.  Temporal Segmentation of Egocentric Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Bo Gao,et al.  A discriminative key pose sequence model for recognizing human interactions , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[49]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[50]  Nicu Sebe,et al.  Quantization-based hashing: a general framework for scalable image and video retrieval , 2018, Pattern Recognit..

[51]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[52]  Sangmin Oh,et al.  Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach , 2013, 2013 IEEE International Conference on Computer Vision.

[53]  Chen Yu,et al.  Viewpoint Integration for Hand-Based Recognition of Social Interactions from a First-Person View , 2015, ICMI.

[54]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[55]  Rasmus Berg Palm,et al.  Prediction as a candidate for learning deep hierarchical models of data , 2012 .