Egocentric Activity Prediction via Event Modulated Attention

Predicting future activities from an egocentric viewpoint is of particular interest in assisted living. However, state-of-the-art egocentric activity understanding techniques are mostly NOT capable of predictive tasks, as their synchronous processing architecture performs poorly in either modeling event dependency or pruning temporal redundant features. This work explicitly addresses these issues by proposing an asynchronous gaze-event driven attentive activity prediction network. This network is built on a gaze-event extraction module inspired by the fact that gaze moving in/out of a certain object most probably indicates the occurrence/ending of a certain activity. The extracted gaze events are input to: (1) an asynchronous module which reasons about the temporal dependency between events and (2) a synchronous module which softly attends to informative temporal durations for more compact and discriminative feature extraction. Both modules are seamlessly integrated for collaborative prediction. Extensive experimental results on egocentric activity prediction as well as recognition well demonstrate the effectiveness of the proposed method.

[1]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[2]  Ann Lehman,et al.  JMP for Basic Univariate and Multivariate Statistics: Methods for Researchers and Social Scientists, Second Edition , 2013 .

[3]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[4]  O. Aalen,et al.  Survival and Event History Analysis: A Process Point of View , 2008 .

[5]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[6]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[7]  Ali Borji,et al.  Probabilistic learning of task-specific visual attention , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[9]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[10]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[11]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Qi Zhao,et al.  Deep Future Gaze: Gaze Anticipation on Egocentric Videos Using Adversarial Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Utkarsh Upadhyay,et al.  Recurrent Marked Temporal Point Processes: Embedding Event History to Vector , 2016, KDD.

[14]  P. Grambsch Survival and Event History Analysis: A Process Point of View by AALEN, O. O., BORGAN, O., and GJESSING, H. K. , 2009 .

[15]  Shmuel Peleg,et al.  Compact CNN for indexing egocentric videos , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[16]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[17]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[18]  Xiaolong Zhu,et al.  Pixel-Level Hand Detection with Shape-Aware Structured Forests , 2014, ACCV.

[19]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[20]  P. Perona,et al.  Objects predict fixations better than early saliency. , 2008, Journal of vision.

[21]  Dima Damen,et al.  Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Larry H. Matthies,et al.  Pooled motion features for first-person videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yu Liu,et al.  Quality Aware Network for Set to Set Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  C. V. Jawahar,et al.  First Person Action Recognition Using Deep Learned Descriptors , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  James M. Rehg,et al.  Delving into egocentric actions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[29]  Hongyuan Zha,et al.  Modeling the Intensity Function of Point Process Via Recurrent Neural Networks , 2017, AAAI.

[30]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[31]  Bingbing Ni,et al.  Cascaded Interactional Targeting Network for Egocentric Video Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[33]  A. Hawkes Spectra of some self-exciting and mutually exciting point processes , 1971 .

[34]  Christian Wolf,et al.  Action Classification in Soccer Videos with Long Short-Term Memory Recurrent Neural Networks , 2010, ICANN.