TACNet: Transition-Aware Context Network for Spatio-Temporal Action Detection

Current state-of-the-art approaches for spatio-temporal action detection have achieved impressive results but remain unsatisfactory for temporal extent detection. The main reason comes from that, there are some ambiguous states similar to the real actions which may be treated as target actions even by a well trained network. In this paper, we define these ambiguous samples as “transitional states”, and propose a Transition-Aware Context Network (TACNet) to distinguish transitional states. The proposed TACNet includes two main components, i.e., temporal context detector and transition-aware classifier. The temporal context detector can extract long-term context information with constant time complexity by constructing a recurrent network. The transition-aware classifier can further distinguish transitional states by classifying action and transitional states simultaneously. Therefore, the proposed TACNet can substantially improve the performance of spatio-temporal action detection. We extensively evaluate the proposed TACNet on UCF101-24 and J-HMDB datasets. The experimental results demonstrate that TACNet obtains competitive performance on JHMDB and significantly outperforms the state-of-the-art methods on the untrimmed UCF101 24 in terms of both frame-mAP and video-mAP.

[1]  Cordelia Schmid,et al.  Multi-fold MIL Training for Weakly Supervised Object Localization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Alberto Del Bimbo,et al.  Am I Done? Predicting Action Progress in Videos , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[3]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, International Journal of Computer Vision.

[4]  Cees Snoek,et al.  VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..

[5]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[6]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Wei Liu,et al.  DSSD : Deconvolutional Single Shot Detector , 2017, ArXiv.

[8]  Suman Saha,et al.  Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos , 2016, BMVC.

[9]  Cordelia Schmid,et al.  Multi-region Two-Stream R-CNN for Action Detection , 2016, ECCV.

[10]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[11]  Suman Saha,et al.  AMTnet: Action-Micro-Tube Regression by End-to-end Trainable Deep Architecture , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Cordelia Schmid,et al.  Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Tao Xiang,et al.  Weakly Supervised Action Detection , 2011, BMVC.

[14]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[15]  Suman Saha,et al.  Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Shih-Fu Chang,et al.  Localizing Actions from Video Labels and Pseudo-Annotations , 2017, BMVC.

[17]  Rui Hou,et al.  Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Shijian Lu,et al.  TORNADO: A Spatio-Temporal Convolutional Regression Network for Video Action Proposal , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[20]  W. Eisenstadt,et al.  Combined differential and common-mode scattering parameters: theory and simulation , 1995 .

[21]  Cordelia Schmid,et al.  Action Tubelet Detector for Spatio-Temporal Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[24]  Cees Snoek,et al.  Spot On: Action Localization from Pointly-Supervised Proposals , 2016, ECCV.

[25]  Gang Yu,et al.  Fast action proposals for human action detection and search , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).