Weakly-Supervised Completion Moment Detection using Temporal Attention

Monitoring the progression of an action towards completion offers fine grained insight into the actor's behaviour. In this work, we target detecting the completion moment of actions, that is the moment when the action's goal has been successfully accomplished. This has potential applications from surveillance to assistive living and human-robot interactions. Previous effort required human annotations of the completion moment for training (i.e. full supervision). In this work, we present an approach for moment detection from weak video-level labels. Given both complete and incomplete sequences, of the same action, we learn temporal attention, along with accumulated completion prediction from all frames in the sequence. We also demonstrate how the approach can be used when completion moment supervision is available. We evaluate and compare our approach on actions from three datasets, namely HMDB, UCF101 and RGBD-AC, and show that temporal attention improves detection in both weakly-supervised and fully-supervised settings.

[1]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[2]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[3]  Xiaogang Wang,et al.  Diversity Regularized Spatiotemporal Attention for Video-Based Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Ling-Yu Duan,et al.  Unified Spatio-Temporal Attention Networks for Action Recognition in Videos , 2019, IEEE Transactions on Multimedia.

[5]  Cees Snoek,et al.  VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..

[6]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Fernando De la Torre,et al.  Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Luc Van Gool,et al.  UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Amit K. Roy-Chowdhury,et al.  W-TALC: Weakly-supervised Temporal Activity Localization and Classification , 2018, ECCV.

[11]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Louis-Philippe Morency,et al.  Temporal Attention-Gated Model for Robust Sequence Classification , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[15]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[16]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Misha Denil,et al.  Learning Where to Attend with Deep Architectures for Image Tracking , 2011, Neural Computation.

[18]  Jonathan Tompson,et al.  Temporal Cycle-Consistency Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Li Fei-Fei,et al.  Recurrent Attention Models for Depth-Based Person Identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Majid Mirmehdi,et al.  Beyond Action Recognition: Action Completion in RGB-D Data , 2016, BMVC.

[23]  乔宇 Motionlets: Mid-Level 3D Parts for Human Motion Recognition , 2013 .

[24]  Dima Damen,et al.  Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Yu Qiao,et al.  Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos , 2018, IEEE Transactions on Image Processing.

[26]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[28]  Ali Farhadi,et al.  Much Ado About Time: Exhaustive Annotation of Temporal Data , 2016, HCOMP.

[29]  Koray Kavukcuoglu,et al.  Multiple Object Recognition with Visual Attention , 2014, ICLR.

[30]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[31]  Majid Mirmehdi,et al.  Action Completion: A Temporal Model for Moment Detection , 2018, BMVC.

[32]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Bohyung Han,et al.  Weakly Supervised Action Localization by Sparse Temporal Pooling Network , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Li Fei-Fei,et al.  End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Dima Damen,et al.  Action Recognition From Single Timestamp Supervision in Untrimmed Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Baoxin Li,et al.  Hierarchical Attention Network for Action Recognition in Videos , 2016, ArXiv.

[38]  Alberto Del Bimbo,et al.  Am I Done? Predicting Action Progress in Videos , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[39]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[40]  Jing Xu,et al.  Attention-Aware Compositional Network for Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Deva Ramanan,et al.  Attentional Pooling for Action Recognition , 2017, NIPS.

[42]  Li Fei-Fei,et al.  Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.