Spatio-temporal deformable 3D ConvNets with attention for action recognition

Abstract The irregularity of human actions poses great challenges in video action recognition. Recently, 3D ConvNet methods have shown promising performance at modelling the motion and appearance information. However, the fixed geometric structure of 3D convolution filters largely limits the learning capacity for video action recognition. To address this problem, this paper proposes a spatio-temporal deformable ConvNet module with an attention mechanism, which takes into consideration the mutual correlations in both temporal and spatial domains, to effectively capture the long-range and long-distance dependencies in the video actions. Our attention based deformable module, as a generic module for 3D ConvNets, can adaptively learn more accurate spatio-temporal offsets to model the action irregularity. The experiments on two popular datasets (UCF-101 and HMDB-51) demonstrate that our module significantly outperforms the state-of-the-art methods.

[1]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[2]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[3]  Yang Du,et al.  Interaction-Aware Spatio-Temporal Pyramid Attention Networks for Action Classification , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Lingfeng Wang,et al.  Joint spatial-temporal attention for action recognition , 2018, Pattern Recognit. Lett..

[5]  Ali Borji,et al.  Revisiting Video Saliency: A Large-Scale Benchmark and a New Model , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[10]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Sridha Sridharan,et al.  Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[12]  Baoxin Li,et al.  Multi-stream CNN: Learning representations based on human-related regions for action recognition , 2018, Pattern Recognit..

[13]  Shijian Lu,et al.  TORNADO: A Spatio-Temporal Convolutional Regression Network for Video Action Proposal , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Stan Sclaroff,et al.  Do less and achieve more: Training CNNs for action recognition utilizing action images from the Web , 2015, Pattern Recognit..

[15]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[16]  Junmo Kim,et al.  Active Convolution: Learning the Shape of Convolution for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Li Xie,et al.  Fully convolutional networks for action recognition , 2017, IET Comput. Vis..

[18]  Ling-Yu Duan,et al.  Unified Spatio-Temporal Attention Networks for Action Recognition in Videos , 2019, IEEE Transactions on Multimedia.

[19]  Ruigang Yang,et al.  Saliency-Aware Video Object Segmentation , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[22]  Gang Wang,et al.  Global Context-Aware Attention LSTM Networks for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Chuang Gan,et al.  Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering , 2019, AAAI.

[24]  Kai Liu,et al.  Tensor-based linear dynamical systems for action recognition from 3D skeletons , 2018, Pattern Recognit..

[25]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ling Shao,et al.  Compressive Sequential Learning for Action Similarity Labeling , 2016, IEEE Transactions on Image Processing.

[28]  Xiaoshuai Sun,et al.  Two-Stream 3-D convNet Fusion for Action Recognition in Videos With Arbitrary Size and Length , 2018, IEEE Transactions on Multimedia.

[29]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Ruigang Yang,et al.  Semi-Supervised Video Object Segmentation with Super-Trajectories , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[34]  Stephen Lin,et al.  Deformable ConvNets V2: More Deformable, Better Results , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[36]  Luc Van Gool,et al.  Spatio-Temporal Channel Correlation Networks for Action Classification , 2018, ECCV.

[37]  Jiwen Lu,et al.  Learning principal orientations and residual descriptor for action recognition , 2019, Pattern Recognit..

[38]  Xiaoyan Sun,et al.  MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Sergio Escalera,et al.  Residual Stacked RNNs for Action Recognition , 2018, ECCV Workshops.

[40]  Tiejun Huang,et al.  Temporal Attentive Network for Action Recognition , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[41]  Jean-Michel Morel,et al.  A non-local algorithm for image denoising , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[42]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[43]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[44]  Kaiming He,et al.  Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Ling Shao,et al.  Video Salient Object Detection via Fully Convolutional Networks , 2017, IEEE Transactions on Image Processing.

[46]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[47]  Meng Wang,et al.  Self-Supervised Video Hashing With Hierarchical Binary Auto-Encoder , 2018, IEEE Transactions on Image Processing.

[48]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[49]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Xuelong Li,et al.  From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[52]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Bingbing Ni,et al.  Binary Coding for Partial Action Analysis with Limited Observation Ratios , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Hao Yang,et al.  Time-Asymmetric 3d Convolutional Neural Networks for Action Recognition , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[55]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[56]  Yadong Mu,et al.  Boosting Temporal Binary Coding for Large-Scale Video Search , 2017, IEEE Transactions on Multimedia.

[57]  Chao Li,et al.  Active Transfer Learning Network: A Unified Deep Joint Spectral–Spatial Feature Learning Model for Hyperspectral Image Classification , 2019, IEEE Transactions on Geoscience and Remote Sensing.

[58]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[59]  Bingbing Ni,et al.  Zero-Shot Action Recognition with Error-Correcting Output Codes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Dahua Lin,et al.  Trajectory Convolution for Action Recognition , 2018, NeurIPS.