TEINet: Towards an Efficient Architecture for Video Recognition

Efficiency is an important issue in designing video architectures for action recognition. 3D CNNs have witnessed remarkable progress in action recognition from videos. However, compared with their 2D counterparts, 3D convolutions often introduce a large amount of parameters and cause high computational cost. To relieve this problem, we propose an efficient temporal module, termed as Temporal Enhancement-and-Interaction (TEI Module), which could be plugged into the existing 2D CNNs (denoted by TEINet). The TEI module presents a different paradigm to learn temporal features by decoupling the modeling of channel correlation and temporal interaction. First, it contains a Motion Enhanced Module (MEM) which is to enhance the motion-related features while suppress irrelevant information (e.g., background). Then, it introduces a Temporal Interaction Module (TIM) which supplements the temporal contextual information in a channel-wise manner. This two-stage modeling scheme is not only able to capture temporal structure flexibly and effectively, but also efficient for model inference. We conduct extensive experiments to verify the effectiveness of TEINet on several benchmarks (e.g., Something-Something V1&V2, Kinetics, UCF101 and HMDB51). Our proposed TEINet can achieve a good recognition accuracy on these datasets but still preserve a high efficiency.

[1]  Chong-Wah Ngo,et al.  Learning Spatio-Temporal Representation With Local and Global Diffusion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Song Han,et al.  Temporal Shift Module for Efficient Video Understanding , 2018, ArXiv.

[4]  Yi Yang,et al.  DevNet: A Deep Event Network for multimedia event detection and evidence recounting , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Limin Wang,et al.  Temporal Segment Networks for Action Recognition in Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[7]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[9]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[11]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[12]  Abhinav Gupta,et al.  Videos as Space-Time Region Graphs , 2018, ECCV.

[13]  Xiao Liu,et al.  StNet: Local and Global Spatial-Temporal Modeling for Action Recognition , 2018, AAAI.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Limin Wang,et al.  Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[20]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[21]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Thomas Brox,et al.  ECO: Efficient Convolutional Network for Online Video Understanding , 2018, ECCV.

[23]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[27]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[28]  Bowen Zhang,et al.  Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Deva Ramanan,et al.  Attentional Pooling for Action Recognition , 2017, NIPS.

[30]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[32]  Xiang Li,et al.  Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks , 2019, ArXiv.

[33]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.