Graph Regularization Network with Semantic Affinity for Weakly-Supervised Temporal Action Localization

This paper presents a novel deep architecture for weakly-supervised temporal action localization that not only generates segment-level action responses but also propagates segment-level responses to the neighborhood in a form of graph Laplacian regularization. Specifically, our approach consists of two sub-modules; a class activation module to estimate the action score map over time through the action classifiers, and a graph regularization module to refine the estimated action score map by solving a quadratic programming problem with the predicted segment-level semantic affinities. Since these two modules are integrated with fully differentiable layers, the proposed networks can be jointly trained in an end-to-end manner. Experimental results on Thumos14 and ActivityNet1.2 demonstrate that the proposed method provides outstanding performances in weakly-supervised temporal action localization.

[1]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[2]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[3]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Amit K. Roy-Chowdhury,et al.  W-TALC: Weakly-supervised Temporal Activity Localization and Classification , 2018, ECCV.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Bingbing Ni,et al.  Temporal Action Localization with Pyramid of Score Distribution Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  J. Zico Kolter,et al.  OptNet: Differentiable Optimization as a Layer in Neural Networks , 2017, ICML.

[8]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[9]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, International Journal of Computer Vision.

[10]  Yong Jae Lee,et al.  Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Yale Song,et al.  Video co-summarization: Video summarization by visual co-occurrence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Lei Zhang,et al.  AutoLoc: Weakly-supervised Temporal Action Localization , 2018, ECCV.

[14]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[15]  Bohyung Han,et al.  Weakly Supervised Action Localization by Sparse Temporal Pooling Network , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Gene Cheung,et al.  Graph Laplacian Regularization for Image Denoising: Analysis in the Continuous Domain , 2016, IEEE Transactions on Image Processing.

[18]  Daniel Cremers,et al.  An Improved Algorithm for TV-L 1 Optical Flow , 2009, Statistical and Geometrical Approaches to Visual Motion Analysis.

[19]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Tong Lu,et al.  Temporal Action Localization by Structured Maximal Sums , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Luc Van Gool,et al.  UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yi Yang,et al.  DevNet: A Deep Event Network for multimedia event detection and evidence recounting , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[24]  Ramakant Nevatia,et al.  Cascaded Boundary Regression for Temporal Action Detection , 2017, BMVC.

[25]  Bernard Ghanem,et al.  Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).