A Novel Two-stream Architecture Fusing Static And Dynamic Features for Human Action Recognition

—Action recognition in real-world videos is di ffi cult because of factors such as scenery muddle, scale alter, dynamic standpoint, and sharp motion.This paper proposes a novel two-stream architecture fusing static and dynamic features to recognize human actions in videos. Firstly, the original image (single frame and optical flow fields) is extracted by Convolutional Neural Network (CNN) to obtain feature maps. Secondly, we extract the obtained feature maps via a 3 (cid:2) 3 convolution over all the neighbor features, leading to a static representations of features. Then we concatenate these static features with input feature maps to produce the dynamic attention matrix through two 1 (cid:2) 1 convolutions. All of the generated feature maps are then aggregated using the learnt attention matrix, producing a dynamic representation. Thirdly, we take the interaction of the static and dynamic presentations as final outputs. Finally, we utilize Long Short-Term Memory (LSTM) to catch time sequence information among dense optical flow.The experimental results on the three hard datasets UCF101, HMDB51, and Kinetics400 have shown that the method works better than other state-of-the-art methods.

[1]  Yindong Lian,et al.  Multi-AGV Tracking System Based on Global Vision and AprilTag in Smart Warehouse , 2022, Journal of Intelligent and Robotic Systems.

[2]  P. Diamandis,et al.  Compound computer vision workflow for efficient and automated immunohistochemical analysis of whole slide images , 2022, Journal of Clinical Pathology.

[3]  Ehtesham Hassan,et al.  Learning Video Actions in Two Stream Recurrent Neural Network , 2021, Pattern Recognit. Lett..

[4]  Hinrich Schutze,et al.  Wine is Not v i n. - On the Compatibility of Tokenizations Across Languages , 2021, EMNLP.

[5]  Tao Ding,et al.  Residential load forecasting based on LSTM fusing self-attention mechanism with pooling , 2021 .

[6]  Tao Mei,et al.  Contextual Transformer Networks for Visual Recognition , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Chase Cotton,et al.  A novel LSTM–CNN–grid search-based deep neural network for sentiment analysis , 2021, The Journal of Supercomputing.

[8]  Eduardo Lleida,et al.  Convolutional Recurrent Neural Networks for Speech Activity Detection in Naturalistic Audio from Apollo Missions , 2021, IberSPEECH.

[9]  Hichem Snoussi,et al.  Pose-Guided Inflated 3D ConvNet for action recognition in videos , 2021, Signal Process. Image Commun..

[10]  Chuanxu Wang,et al.  RGB-D Human Action Recognition of Deep Feature Enhancement and Fusion Using Two-Stream ConvNet , 2021, J. Sensors.

[11]  Suha Kwak,et al.  MotionSqueeze: Neural Motion Feature Learning for Video Understanding , 2020, ECCV.

[12]  Ross B. Girshick,et al.  A Multigrid Method for Efficiently Training Video Models , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Wei Wu,et al.  STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Yadong Mu,et al.  Two-Stream Video Classification with Cross-Modality Attention , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[15]  Wei Dai,et al.  MultiCAM: Multiple Class Activation Mapping for Aircraft Recognition in Remote Sensing Images , 2019, Remote. Sens..

[16]  Nelson Fernández,et al.  Two-stream convolutional networks for end-to-end learning of self-driving cars , 2018, ArXiv.

[17]  Xiaohui Xie,et al.  DeepEM: Deep 3D ConvNets With EM For Weakly Supervised Pulmonary Nodule Detection , 2018, bioRxiv.

[18]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning For Video Understanding , 2017, ArXiv.

[19]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Luc Van Gool,et al.  Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification , 2017, ArXiv.

[21]  Limin Wang,et al.  Temporal Segment Networks for Action Recognition in Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Sarmiza Pencea,et al.  China , 2019, The Statesman’s Yearbook 2019.

[23]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Lin Sun,et al.  Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[29]  Cheng Dai,et al.  Human action recognition using two-stream attention based LSTM networks , 2020, Appl. Soft Comput..

[30]  Heng Wang LEAR-INRIA submission for the THUMOS workshop , 2013 .