论文信息 - Detecting Temporal Proposal for Action Localization with Tree-structured Search Policy

Detecting Temporal Proposal for Action Localization with Tree-structured Search Policy

Understanding the semantics in videos is a complex but crucial task in video analysis. This paper focuses on localizing category-independent events, actions or other semantics in an untrimmed video, referred as salient temporal proposal localization. Traditional methods like sliding window have a high computational cost due to the densely sampling of different video segments. We propose a reinforcement learning based method, which trains a localizer that learns a search policy that, instead of exploring every video segment, finds an optimal search path to locate a salient proposal based on the currently observing video segment in a tree structure, therefore reduces the number of video segments fed into the proposal detector. In each search step, a localizer is trained to iteratively select the next sub-region containing salient proposals to continue the search, and a proposal detector is trained to recognize salient proposal from the sub-regions. The experiments demonstrate that our method is able to precisely detect salient proposals with a comparable recall and with much fewer candidate windows.

[1] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[2] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[3] Li Fei-Fei,et al. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[4] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[5] Cordelia Schmid,et al. Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[6] Cristian Sminchisescu,et al. Reinforcement Learning for Visual Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8] Shih-Fu Chang,et al. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[11] Philip H. S. Torr,et al. BING: Binarized normed gradients for objectness estimation at 300fps , 2014, Computational Visual Media.

[12] Patrick Bouthemy,et al. Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Koen E. A. van de Sande,et al. Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[15] Jitendra Malik,et al. Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Miriam Bellver,et al. Hierarchical Object Detection with Deep Reinforcement Learning , 2016, NIPS 2016.

[17] Li Fei-Fei,et al. End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Ramakant Nevatia,et al. Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images , 2015, ACM Multimedia.

[19] Ming-Hsuan Yang,et al. Top-down visual saliency via joint CRF and dictionary learning , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20] Shuicheng Yan,et al. Tree-Structured Reinforcement Learning for Sequential Object Localization , 2016, NIPS.

[21] Yi Yang,et al. A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Yi Li,et al. R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[23] Svetlana Lazebnik,et al. Active Object Localization with Deep Reinforcement Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).