论文信息 - Activity Graph Transformer for Temporal Action Localization

Activity Graph Transformer for Temporal Action Localization

We introduce Activity Graph Transformer, an end-to-end learnable model for temporal action localization, that receives a video as input and directly predicts a set of action instances that appear in the video. Detecting and localizing action instances in untrimmed videos requires reasoning over multiple action instances in a video. The dominant paradigms in the literature process videos temporally to either propose action regions or directly produce frame-level detections. However, sequential processing of videos is problematic when the action instances have nonsequential dependencies and/or non-linear temporal ordering, such as overlapping action instances or re-occurrence of action instances over the course of the video. In this work, we capture this non-linear temporal structure by reasoning over the videos as non-sequential entities in the form of graphs. We evaluate our model on challenging datasets: THUMOS14, Charades, and EPIC-Kitchens-100. Our results show that our proposed model outperforms the stateof-the-art by a considerable margin.

Megha Nawhal | Greg Mori

[1] R. Nevatia,et al. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2] Rahul Sukthankar,et al. Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[4] Li Fei-Fei,et al. End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Cordelia Schmid,et al. Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[6] Shih-Fu Chang,et al. CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Andrew Zisserman,et al. Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Suman Saha,et al. Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[9] Patrick Bouthemy,et al. Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10] Bingbing Ni,et al. Temporal Action Localization with Pyramid of Score Distribution Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Juan Carlos Niebles,et al. Spatio-Temporal Graph for Video Captioning With Knowledge Distillation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Bernard Ghanem,et al. SST: Single-Stream Temporal Action Proposals , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[14] Jitendra Malik,et al. Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] René Vidal,et al. Representation Learning on Visual-Symbolic Graphs for Video Understanding , 2020, ECCV.

[16] Pietro Liò,et al. Graph Attention Networks , 2017, ICLR.

[17] Tong Lu,et al. Temporal Action Localization by Structured Maximal Sums , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[19] Xiaoou Tang,et al. Action Recognition and Detection by Combining Motion and Appearance Features , 2014 .

[20] Bernard Ghanem,et al. Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Runhao Zeng,et al. Graph Convolutional Networks for Temporal Action Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22] Limin Wang,et al. Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23] Cordelia Schmid,et al. Action Tubelet Detector for Spatio-Temporal Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24] Max Welling,et al. Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[25] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26] Stan Sclaroff,et al. Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Kate Saenko,et al. R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[29] Ivan Laptev,et al. On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[30] Cordelia Schmid,et al. Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[31] Fei-Fei Li,et al. Combining the Right Features for Complex Event Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[32] Ming Yang,et al. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.

[33] Silvio Savarese,et al. Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Limin Wang,et al. Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Bernard Ghanem,et al. G-TAD: Sub-Graph Localization for Temporal Action Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Ming Shao,et al. A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Shih-Fu Chang,et al. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Michael S. Ryoo,et al. Temporal Gaussian Mixture Layer for Videos , 2018, ICML.

[39] Andrew Y. Ng,et al. End-to-End People Detection in Crowded Scenes , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Juergen Gall,et al. Temporal Action Detection Using a Statistical Language Model , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Dustin Tran,et al. Image Transformer , 2018, ICML.

[42] Bernard Ghanem,et al. DAPs: Deep Action Proposals for Action Understanding , 2016, ECCV.

[43] Bernard Ghanem,et al. SCC: Semantic Context Cascade for Efficient Action Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Michael S. Ryoo,et al. AViD Dataset: Anonymized Videos from Diverse Countries , 2020, NeurIPS.

[45] Lei Zhang,et al. AutoLoc: Weakly-supervised Temporal Action Localization , 2018, ECCV.

[46] Ali Farhadi,et al. Video Relationship Reasoning Using Gated Spatio-Temporal Energy Graph , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Efstratios Gavves,et al. VideoGraph: Recognizing Minutes-Long Human Activities in Videos , 2019, ArXiv.

[48] Kristen Grauman,et al. Ego-Topo: Environment Affordances From Egocentric Video , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Cees G. M. Snoek,et al. ActionBytes: Learning From Trimmed Videos to Localize Actions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.

[51] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[52] Luc Van Gool,et al. UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53] Cordelia Schmid,et al. Temporal Localization of Actions with Actoms. , 2013, IEEE transactions on pattern analysis and machine intelligence.

[54] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[55] Chen Ju,et al. Bottom-Up Temporal Action Localization with Mutual Regularization , 2020, ECCV.

[56] Michael S. Ryoo,et al. Learning Latent Super-Events to Detect Multiple Activities in Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57] Deva Ramanan,et al. Predictive-Corrective Networks for Action Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58] Larry S. Davis,et al. Temporal Context Network for Activity Localization in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[59] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60] Abhinav Gupta,et al. Videos as Space-Time Region Graphs , 2018, ECCV.

[61] Gregory D. Hager,et al. Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62] Yang Yang,et al. Boundary Content Graph Neural Network for Temporal Action Proposal Generation , 2020, ECCV.

[63] Bowen Zhang,et al. Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[65] Shilei Wen,et al. BMN: Boundary-Matching Network for Temporal Action Proposal Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[66] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[67] Bernard Ghanem,et al. End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos , 2017, BMVC.