Time-Conditioned Action Anticipation in One Shot

The goal of human action anticipation is to predict future actions. Ideally, in real-world applications such as video surveillance and self-driving systems, future actions should not only be predicted with high accuracy but also at arbitrary and variable time-horizons ranging from short- to long-term predictions. Current work mostly focuses on predicting the next action and thus long-term prediction is achieved by recursive prediction of each next action, which is both inefficient and accumulates errors. In this paper, we propose a novel time-conditioned method for efficient and effective long-term action anticipation. There are two key ingredients to our approach. First, by explicitly conditioning our anticipation network on time allows to efficiently anticipate also long-term actions. And second, we propose an attended temporal feature and a time-conditioned skip connection to extract relevant and useful information from observations for effective anticipation. We conduct extensive experiments on the large-scale Epic-Kitchen and the 50Salads Datasets. Experimental results show that the proposed method is capable of anticipating future actions at both short-term and long-term, and achieves state-of-the-art performance.

[1]  Mohammed Bennamoun,et al.  Human Interaction Prediction Using Deep Temporal Features , 2016, ECCV Workshops.

[2]  Amit K. Roy-Chowdhury,et al.  Joint Prediction of Activity Labels and Starting Times in Untrimmed Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[4]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[5]  Lars Petersson,et al.  Encouraging LSTMs to Anticipate Actions Very Early , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Shih-Fu Chang,et al.  Online Detection of Action Start in Untrimmed, Streaming Videos , 2018, ECCV.

[7]  Silvio Savarese,et al.  A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[8]  Ling-Yu Duan,et al.  VERI-Wild: A Large Dataset and a New Method for Vehicle Re-Identification in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Gang Wang,et al.  Early Action Prediction by Soft Regression , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[11]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Gang Wang,et al.  Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Ramakant Nevatia,et al.  RED: Reinforced Encoder-Decoder Networks for Action Anticipation , 2017, BMVC.

[16]  Gang Wang,et al.  Skeleton-Based Online Action Prediction Using Scale Selection Network , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Yun Fu,et al.  A Discriminative Model with Multiple Temporal Scales for Action Prediction , 2014, ECCV.

[19]  Mohammed Bennamoun,et al.  Learning Action Recognition Model from Depth and Skeleton Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Yun Fu,et al.  Deep Sequential Context Networks for Action Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Mohammed Bennamoun,et al.  Leveraging Structural Context Models and Ranking Score Fusion for Human Interaction Prediction , 2018, IEEE Transactions on Multimedia.

[22]  Mario Fritz,et al.  Recognition of ongoing complex activities by sequence prediction over a hierarchical label space , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[23]  Stephen J. McKenna,et al.  Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[24]  Fernando De la Torre,et al.  Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[26]  Gang Wang,et al.  SSNet: Scale Selection Network for Online 3D Action Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Jiwen Lu,et al.  Part-Activated Deep Reinforcement Learning for Action Prediction , 2018, ECCV.

[28]  Yazan Abu Farha,et al.  When will you do what? - Anticipating Temporal Occurrences of Activities , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Gregory D. Hager,et al.  Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Arif Mahmood,et al.  HOPC: Histogram of Oriented Principal Components of 3D Pointclouds for Action Recognition , 2014, ECCV.

[31]  Thomas Serre,et al.  The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[33]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Ajmal Mian,et al.  Learning a Deep Model for Human Action Recognition from Novel Viewpoints , 2016 .

[35]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Gang Wang,et al.  Real-Time RGB-D Activity Prediction by Soft Regression , 2016, ECCV.

[37]  Song-Chun Zhu,et al.  Predicting Human Activities Using Stochastic Grammar , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Mohammed Bennamoun,et al.  Learning Clip Representations for Skeleton-Based 3D Action Recognition , 2018, IEEE Transactions on Image Processing.

[40]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.