论文信息 - TraMNet - Transition Matrix Network for Efficient Action Tube Proposals

TraMNet - Transition Matrix Network for Efficient Action Tube Proposals

Current state-of-the-art methods solve spatio-temporal action localisation by extending 2D anchors to 3D-cuboid proposals on stacks of frames, to generate sets of temporally connected bounding boxes called action micro-tubes. However, they fail to consider that the underlying anchor proposal hypotheses should also move (transition) from frame to frame, as the actor or the camera do. Assuming we evaluate n 2D anchors in each frame, then the number of possible transitions from each 2D anchor to the next, for a sequence of f consecutive frames, is in the order of \(O(n^f)\), expensive even for small values of f.

[1] Patrick Bouthemy,et al. Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[3] Cordelia Schmid,et al. Towards Weakly-Supervised Action Localization , 2016, ArXiv.

[4] Cordelia Schmid,et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5] Haroon Idrees,et al. The THUMOS challenge on action recognition for videos "in the wild" , 2016, Comput. Vis. Image Underst..

[6] Cordelia Schmid,et al. Action Tubelet Detector for Spatio-Temporal Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7] Mubarak Shah,et al. What If We Do Not have Multiple Videos of the Same Action? — Video Action Localization Using Web Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Suman Saha,et al. AMTnet: Action-Micro-Tube Regression by End-to-end Trainable Deep Architecture , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Suman Saha,et al. Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[11] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.

[13] Thomas Brox,et al. Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14] Ali Farhadi,et al. YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Rui Hou,et al. Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16] Radu Horaud,et al. Continuous Gesture Recognition from Articulated Poses , 2014, ECCV Workshops.

[17] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[18] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19] Cordelia Schmid,et al. Multi-region Two-Stream R-CNN for Action Detection , 2016, ECCV.

[20] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22] Kaiming He,et al. Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23] Philip H. S. Torr,et al. Learning Discriminative Space–Time Action Parts from Weakly Labelled Videos , 2013, International Journal of Computer Vision.

[24] Ramakant Nevatia,et al. Spatio-Temporal Action Detection with Cascade Proposal and Location Anticipation , 2017, BMVC.

[25] Cees Snoek,et al. APT: Action localization proposals from dense trajectories , 2015, BMVC.

[26] Suman Saha,et al. Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos , 2016, BMVC.

[27] John B. Moore,et al. Hidden Markov Models: Estimation and Control , 1994 .

[28] Cordelia Schmid,et al. Action recognition by dense trajectories , 2011, CVPR 2011.

[29] Cordelia Schmid,et al. Efficient Action Localization with Approximately Normalized Fisher Vectors , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30] Cordelia Schmid,et al. Human Action Localization with Sparse Spatial Supervision , 2017 .

[31] Cordelia Schmid,et al. Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32] Wei Liu,et al. ParseNet: Looking Wider to See Better , 2015, ArXiv.

[33] Jitendra Malik,et al. Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Thomas Brox,et al. High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[35] Gang Yu,et al. Fast action proposals for human action detection and search , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).