SRF-Net: Selective Receptive Field Network for Anchor-Free Temporal Action Detection

Temporal action detection (TAD) is a challenging task which aims to temporally localize and recognize the human action in untrimmed videos. Current mainstream one-stage TAD approaches localize and classify action proposals relying on pre-defined anchors, where the location and scale for action instances are set by designers. Obviously, such an anchor-based TAD method limits its generalization capability and will lead to performance degradation when videos contain rich action variation. In this study, we explore to remove the requirement of pre-defined anchors for TAD methods. A novel TAD model termed as Selective Receptive Field Network (SRF-Net) is developed, in which the location offsets and classification scores at each temporal location can be directly estimated in the feature map and SRF-Net is trained in an end-to-end manner. Innovatively, a building block called Selective Receptive Field Convolution (SRFC) is dedicatedly designed which is able to adaptively adjust its receptive field size according to multiple scales of input information at each temporal location in the feature map. Extensive experiments are conducted on the THUMOS14 dataset, and superior results are reported comparing to state-of-the-art TAD approaches.

[1]  Jun Li,et al.  Graph Attention Based Proposal 3D ConvNets for Action Detection , 2020, AAAI.

[2]  Rongrong Ji,et al.  Fast Learning of Temporal Action Proposal via Dense Boundary Generator , 2019, AAAI.

[3]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  R. Nevatia,et al.  TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Bernard Ghanem,et al.  Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Bernard Ghanem,et al.  End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos , 2017, BMVC.

[7]  Hei Law,et al.  CornerNet: Detecting Objects as Paired Keypoints , 2018, ECCV.

[8]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Shih-Fu Chang,et al.  CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Bernard Ghanem,et al.  SST: Single-Stream Temporal Action Proposals , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[13]  Ming Yang,et al.  BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.

[14]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Tao Mei,et al.  Gaussian Temporal Awareness Networks for Action Localization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Yuning Jiang,et al.  UnitBox: An Advanced Object Detection Network , 2016, ACM Multimedia.

[18]  Hao Chen,et al.  FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Li Fei-Fei,et al.  End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Yuning Jiang,et al.  FoveaBox: Beyound Anchor-Based Object Detection , 2019, IEEE Transactions on Image Processing.

[23]  Bernard Ghanem,et al.  G-TAD: Sub-Graph Localization for Temporal Action Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yiping Tang,et al.  AFO-TAD: Anchor-free One-Stage Detector for Temporal Action Detection , 2019, ArXiv.

[25]  Xu Zhao,et al.  Single Shot Temporal Action Detection , 2017, ACM Multimedia.

[26]  Jian Yang,et al.  Selective Kernel Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Shilei Wen,et al.  BMN: Boundary-Matching Network for Temporal Action Proposal Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.