Hi-EADN: Hierarchical Excitation Aggregation and Disentanglement Frameworks for Action Recognition Based on Videos

Most existing video action recognition methods mainly rely on high-level semantic information from convolutional neural networks (CNNs) but ignore the discrepancies of different information streams. However, it does not normally consider both long-distance aggregations and short-range motions. Thus, to solve these problems, we propose hierarchical excitation aggregation and disentanglement networks (Hi-EADNs), which include multiple frame excitation aggregation (MFEA) and a feature squeeze-and-excitation hierarchical disentanglement (SEHD) module. MFEA specifically uses long-short range motion modelling and calculates the feature-level temporal difference. The SEHD module utilizes these differences to optimize the weights of each spatiotemporal feature and excite motion-sensitive channels. Moreover, without introducing additional parameters, this feature information is processed with a series of squeezes and excitations, and multiple temporal aggregations with neighbourhoods can enhance the interaction of different motion frames. Extensive experimental results confirm our proposed Hi-EADN method effectiveness on the UCF101 and HMDB51 benchmark datasets, where the top-5 accuracy is 93.5% and 76.96%.

[1]  Yujia Peng,et al.  Weak integration of form and motion in two-stream CNNs for action recognition , 2020 .

[2]  Lingxiao He,et al.  Video-based Person Re-identification via 3D Convolutional Networks and Non-local Attention , 2018, ACCV.

[3]  Hao Yang,et al.  Time-Asymmetric 3d Convolutional Neural Networks for Action Recognition , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[4]  Nanning Zheng,et al.  Spatiotemporal neural networks for action recognition based on joint loss , 2019, Neural Computing and Applications.

[5]  Anoop Cherian,et al.  Second-order Temporal Pooling for Action Recognition , 2017, International Journal of Computer Vision.

[6]  K Seemanthini,et al.  Human Detection and Tracking using HOG for Action Recognition , 2018 .

[7]  Dr. Akey Sungheetha,et al.  Comparative Study: Statistical Approach and Deep Learning Method for Automatic Segmentation Methods for Lung CT Image Segmentation , 2020, December 2020.

[8]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[9]  Sinan Kalkan,et al.  Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition , 2020, ECCV Workshops.

[10]  Chen Feng,et al.  Binary Hashing CNN Features for Action Recognition , 2018, KSII Trans. Internet Inf. Syst..

[11]  Lingfeng Wang,et al.  Joint spatial-temporal attention for action recognition , 2018, Pattern Recognit. Lett..

[12]  Joo-Ho Lee,et al.  Multi-scale affined-HOF and dimension selection for view-unconstrained action recognition , 2020, Applied Intelligence.

[13]  Byung Cheol Song,et al.  Action Recognition Using Deep 3D CNNs with Sequential Feature Aggregation and Attention , 2020 .

[14]  John See,et al.  Deep CNN object features for improved action recognition in low quality videos , 2016, IEEE CSE 2016.

[15]  Shicai Liu,et al.  Human Action Recognition Algorithm Based on Improved ResNet and Skeletal Keypoints in Single Image , 2020 .

[16]  V. Sathiesh Kumar,et al.  Efficient inception V2 based deep convolutional neural network for real-time hand action recognition , 2020, IET Image Process..

[17]  Lei Wang,et al.  Hallucinating Bag-of-Words and Fisher Vector IDT terms for CNN-based Action Recognition , 2019, ArXiv.

[18]  Keunju Park,et al.  BshapeNet: Object detection and instance segmentation with bounding shape masks , 2018, Pattern Recognit. Lett..