论文信息 - Multi-region Two-Stream R-CNN for Action Detection

Multi-region Two-Stream R-CNN for Action Detection

We propose a multi-region two-stream R-CNN model for action detection in realistic videos. We start from frame-level action detection based on faster R-CNN [1], and make three contributions: (1) we show that a motion region proposal network generates high-quality proposals , which are complementary to those of an appearance region proposal network; (2) we show that stacking optical flow over several frames significantly improves frame-level action detection; and (3) we embed a multi-region scheme in the faster R-CNN model, which adds complementary information on body parts. We then link frame-level detections with the Viterbi algorithm, and temporally localize an action with the maximum subarray method. Experimental results on the UCF-Sports, J-HMDB and UCF101 action detection datasets show that our approach outperforms the state of the art with a significant margin in both frame-mAP and video-mAP.

Cordelia Schmid | Xiaojiang Peng | C. Schmid | Xiaojiang Peng

[1] Cordelia Schmid,et al. Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2] Cordelia Schmid,et al. P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3] Limin Wang,et al. Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Ying Wu,et al. Discriminative subvolume search for efficient action detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6] Limin Wang,et al. Video Action Detection with Relational Dynamic-Poselets , 2014, ECCV.

[7] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[8] Yu Qiao,et al. Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[9] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10] Patrick Pérez,et al. Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11] Mubarak Shah,et al. Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12] Van Nostrand,et al. Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[13] Thomas Mensink,et al. Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[14] Koen E. A. van de Sande,et al. Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[15] Larry S. Davis,et al. Representing Videos Using Mid-level Discriminative Patches , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[16] Andrew Zisserman,et al. Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[17] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[18] Lin Sun,et al. DL-SFA: Deeply-Learned Slow Feature Analysis for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19] Jitendra Malik,et al. Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[20] Jon Bentley,et al. Programming pearls: algorithm design techniques , 1984, CACM.

[21] David A. Forsyth,et al. Video Event Detection: From Subvolume Localization to Spatiotemporal Path Search , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22] Jitendra Malik,et al. Contextual Action Recognition with R*CNN , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[24] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.

[25] Cordelia Schmid,et al. Efficient Action Localization with Approximately Normalized Fisher Vectors , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26] Jian Sun,et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27] Jitendra Malik,et al. Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Richard P. Wildes,et al. Efficient action spotting based on a spacetime oriented structure representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29] Nikos Komodakis,et al. Object Detection via a Multi-region and Semantic Segmentation-Aware CNN Model , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30] Thomas Brox,et al. High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[31] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[33] Sergio Escalera,et al. ChaLearn Looking at People Challenge 2014: Dataset and Results , 2014, ECCV Workshops.

[34] Jian Sun,et al. Convolutional feature masking for joint object and stuff segmentation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Mubarak Shah,et al. Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[36] C. Lawrence Zitnick,et al. Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[37] David A. McAllester,et al. A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[38] J.K. Aggarwal,et al. Human activity analysis , 2011, ACM Comput. Surv..

[39] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[40] Gang Yu,et al. Fast action proposals for human action detection and search , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42] Cordelia Schmid,et al. Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[43] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.