论文信息 - Video Action Recognition Based on Spatio-temporal Feature Pyramid Module

Video Action Recognition Based on Spatio-temporal Feature Pyramid Module

Modeling the spatio-temporal information of different actions facilitates their recognition. The mainstream 2D convolutional network has low computational cost but cannot capture timing information; the mainstream 3D convolutional network can extract spatio-temporal features but has a huge amount of calculation and is difficult to deploy. In this paper, a Spatiotemporal Feature Pyramid Module(STFPM) is proposed to extract spatio-temporal feature information. STFPM captures temporal information between frames by dilated convolution and fuses feature information by weighted addition. STFPM can be flexibly inserted into the 2D backbone network in a plug-and-play manner. When equipped with STFPM, 2D ResNet-50 achieves good results on UCF101 dataset and HMDB51 dataset.

Ying Chen | Suming Gong

[1] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[2] Luc Van Gool,et al. Deep Temporal Linear Encoding Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[4] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[5] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Philip S. Yu,et al. Spatiotemporal Pyramid Network for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Vladlen Koltun,et al. Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[8] Xu Li,et al. STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition , 2020, ArXiv.

[9] Xiaoyan Sun,et al. MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[11] Cordelia Schmid,et al. Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12] Cewu Lu,et al. Approximated Bilinear Modules for Temporal Modeling , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Tao Mei,et al. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[16] Xiao Liu,et al. StNet: Local and Global Spatial-Temporal Modeling for Action Recognition , 2018, AAAI.

[17] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).