论文信息 - Time-Conditioned Action Anticipation in One Shot

Time-Conditioned Action Anticipation in One Shot

The goal of human action anticipation is to predict future actions. Ideally, in real-world applications such as video surveillance and self-driving systems, future actions should not only be predicted with high accuracy but also at arbitrary and variable time-horizons ranging from short- to long-term predictions. Current work mostly focuses on predicting the next action and thus long-term prediction is achieved by recursive prediction of each next action, which is both inefficient and accumulates errors. In this paper, we propose a novel time-conditioned method for efficient and effective long-term action anticipation. There are two key ingredients to our approach. First, by explicitly conditioning our anticipation network on time allows to efficiently anticipate also long-term actions. And second, we propose an attended temporal feature and a time-conditioned skip connection to extract relevant and useful information from observations for effective anticipation. We conduct extensive experiments on the large-scale Epic-Kitchen and the 50Salads Datasets. Experimental results show that the proposed method is capable of anticipating future actions at both short-term and long-term, and achieves state-of-the-art performance.

[1] Mohammed Bennamoun,et al. Human Interaction Prediction Using Deep Temporal Features , 2016, ECCV Workshops.

[2] Amit K. Roy-Chowdhury,et al. Joint Prediction of Activity Labels and Starting Times in Untrimmed Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3] Dima Damen,et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[4] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[5] Lars Petersson,et al. Encouraging LSTMs to Anticipate Actions Very Early , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6] Shih-Fu Chang,et al. Online Detection of Action Start in Untrimmed, Streaming Videos , 2018, ECCV.

[7] Silvio Savarese,et al. A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[8] Ling-Yu Duan,et al. VERI-Wild: A Large Dataset and a New Method for Vehicle Re-Identification in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Gang Wang,et al. Early Action Prediction by Soft Regression , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10] Ruslan Salakhutdinov,et al. Action Recognition using Visual Attention , 2015, NIPS 2015.

[11] Antonio Torralba,et al. Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Tinne Tuytelaars,et al. Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Gang Wang,et al. Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14] Cordelia Schmid,et al. Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15] Ramakant Nevatia,et al. RED: Reinforced Encoder-Decoder Networks for Action Anticipation , 2017, BMVC.

[16] Gang Wang,et al. Skeleton-Based Online Action Prediction Using Scale Selection Network , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17] Hema Swetha Koppula,et al. Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18] Yun Fu,et al. A Discriminative Model with Multiple Temporal Scales for Action Prediction , 2014, ECCV.

[19] Mohammed Bennamoun,et al. Learning Action Recognition Model from Depth and Skeleton Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20] Yun Fu,et al. Deep Sequential Context Networks for Action Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Mohammed Bennamoun,et al. Leveraging Structural Context Models and Ranking Score Fusion for Human Interaction Prediction , 2018, IEEE Transactions on Multimedia.

[22] Mario Fritz,et al. Recognition of ongoing complex activities by sequence prediction over a hierarchical label space , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[23] Stephen J. McKenna,et al. Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[24] Fernando De la Torre,et al. Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25] Michael S. Ryoo,et al. Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[26] Gang Wang,et al. SSNet: Scale Selection Network for Online 3D Action Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27] Jiwen Lu,et al. Part-Activated Deep Reinforcement Learning for Action Prediction , 2018, ECCV.

[28] Yazan Abu Farha,et al. When will you do what? - Anticipating Temporal Occurrences of Activities , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29] Gregory D. Hager,et al. Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Arif Mahmood,et al. HOPC: Histogram of Oriented Principal Components of 3D Pointclouds for Action Recognition , 2014, ECCV.

[31] Thomas Serre,et al. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[33] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Ajmal Mian,et al. Learning a Deep Model for Human Action Recognition from Novel Viewpoints , 2016 .

[35] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Gang Wang,et al. Real-Time RGB-D Activity Prediction by Soft Regression , 2016, ECCV.

[37] Song-Chun Zhu,et al. Predicting Human Activities Using Stochastic Grammar , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38] Stan Sclaroff,et al. Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Mohammed Bennamoun,et al. Learning Clip Representations for Skeleton-Based 3D Action Recognition , 2018, IEEE Transactions on Image Processing.

[40] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.