Action Co-localization in an Untrimmed Video by Graph Neural Networks

We present an efficient approach for action co-localization in an untrimmed video by exploiting contextual and temporal feature from multiple action proposals. Most existing action localization methods focus on each individual action instances without accounting for the correlations among them. To exploit such correlations, we propose the Graph-based Temporal Action Co-Localization (G-TACL) method, which aggregates contextual features from multiple action proposals to assist temporal localization. This aggregation procedure is achieved with Graph Neural Networks with nodes initialized by the action proposal representations. In addition, a multi-level consistency evaluator is proposed to measure the similarity, which summarizes low-level temporal coincidences, features vector dot products and high-level contextual features similarities between any two proposals. Subsequently, these nodes are iteratively updated with Gated Recurrent Unit (GRU) and the obtained node features are used to regress the temporal boundaries of the action proposals, and finally to localize the action instances. Experiments on the THUMOS’14 and MEXaction2 datasets have demonstrated the efficacy of our proposed method.

[1]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[3]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Bernard Ghanem,et al.  Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Xiaoou Tang,et al.  Action Recognition and Detection by Combining Motion and Appearance Features , 2014 .

[7]  Juergen Gall,et al.  Temporal Action Detection Using a Statistical Language Model , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Bernard Ghanem,et al.  End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos , 2017, BMVC.

[9]  Gang Hua,et al.  Object Affordances Graph Network for Action Recognition , 2019, BMVC.

[10]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[11]  Gang Hua,et al.  Weakly Supervised Temporal Action Localization Through Contrast Based Evaluation Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Gang Hua,et al.  Video Imprint , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Nanning Zheng,et al.  Joint Video Object Discovery and Segmentation by Coupled Dynamic Markov Networks , 2018, IEEE Transactions on Image Processing.

[14]  Amit K. Roy-Chowdhury,et al.  W-TALC: Weakly-supervised Temporal Activity Localization and Classification , 2018, ECCV.

[15]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Nanning Zheng,et al.  Video Object Co-Segmentation from Noisy Videos by a Multi-Level Hypergraph Model , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[17]  Xu Zhao,et al.  Single Shot Temporal Action Detection , 2017, ACM Multimedia.

[18]  Nanning Zheng,et al.  Video Imprint Segmentation for Temporal Action Detection in Untrimmed Videos , 2019, AAAI.

[19]  Cordelia Schmid,et al.  The LEAR submission at Thumos 2014 , 2014 .

[20]  Nanning Zheng,et al.  Video Object Discovery and Co-Segmentation with Extremely Weak Supervision , 2017, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[22]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Bingbing Ni,et al.  Temporal Action Localization with Pyramid of Score Distribution Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Shih-Fu Chang,et al.  CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Sanja Fidler,et al.  Situation Recognition with Graph Neural Networks , 2018 .

[27]  Nanning Zheng,et al.  Joint Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[28]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[29]  Luc Van Gool,et al.  Actionness Estimation Using Hybrid Fully Convolutional Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Tong Lu,et al.  Temporal Action Localization by Structured Maximal Sums , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Yong Dou,et al.  Exploring Temporal Preservation Networks for Precise Temporal Action Localization , 2017, AAAI.

[32]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Song-Chun Zhu,et al.  Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[34]  Nanning Zheng,et al.  ER3: A Unified Framework for Event Retrieval, Recognition and Recounting , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[36]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).