Self-Attention Pooling-Based Long-Term Temporal Network for Action Recognition

With the development of Internet of Things (IoT), self-driving technology has been successful. Yet safe driving faces challenges due to such cases as pedestrians crossing roads. How to sense their movements and identify their behaviors from video data is important. Most of the existing methods fail to: 1) capture long-term temporal relationship well due to their limited temporal coverage and 2) aggregate discriminative representation effectively, such as caused by little or even no attention paid to differences among representations. To address such issues, this work presents a new architecture called a self-attention pooling-based long-term temporal network (SP-LTN), which can learn long-term temporal representations and aggregate those discriminative representations in an end-to-end manner, and on the other hand, effectively conduct long-term representation learning on a given video by capturing spatial information and mining temporal patterns. Next, it develops a self-attention pooling method to predict the importance scores of obtained representations for distinguishing them from each other and then weights them together to highlight the contributions of those discriminative representations in action recognition. Finally, it designs a new loss function that combines a standard cross-entropy loss function with a regularization term to further focus on the discriminative representations while restraining the impact of distractive ones on activity classification. Experimental results on two data sets show that our SP-LTN, fed by only red–green–blue (RGB) frames, outperforms the state-of-the-art methods.

[1]  Van Nhan Vo,et al.  Enhanced Intrusion Detection System for an EH IoT Architecture Using a Cooperative UAV Relay and Friendly UAV Jammer , 2021, IEEE/CAA Journal of Automatica Sinica.

[2]  Abdullah Abusorrah,et al.  An Improved Discriminative Model Prediction Approach to Real-Time Tracking of Objects With Camera as Sensors , 2021, IEEE Sensors Journal.

[3]  Li Hou,et al.  CurveNet: Curvature-Based Multitask Learning Deep Networks for 3D Object Recognition , 2021, IEEE/CAA Journal of Automatica Sinica.

[4]  Billy Peralta,et al.  Co-Training for Visual Object Recognition Based on Self-Supervised Models Using a Cross-Entropy Regularization , 2021, Entropy.

[5]  Rong Jin,et al.  Self-supervised Motion Learning from Static Images , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Lei Shu,et al.  Internet of Things for the Future of Smart Agriculture: A Comprehensive Survey of Emerging Technologies , 2021, IEEE/CAA Journal of Automatica Sinica.

[7]  Jing Bi,et al.  Integrated deep learning method for workload and resource prediction in cloud systems , 2020, Neurocomputing.

[8]  Hongjing Liang,et al.  Neural-Network-Based Event-Triggered Adaptive Control of Nonaffine Nonlinear Multiagent Systems With Dynamic Uncertainties , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[9]  Lianghua He,et al.  Learning Smooth Representation for Unsupervised Domain Adaptation , 2019, IEEE transactions on neural networks and learning systems.

[10]  Junjie Wang,et al.  A Spatio-Temporal Attention Convolution Block for Action Recognition , 2020, Journal of Physics: Conference Series.

[11]  Min Jiang,et al.  Multiple depth-levels features fusion enhanced network for action recognition , 2020, J. Vis. Commun. Image Represent..

[12]  L. Minh Dang,et al.  Sensor-based and vision-based human activity recognition: A comprehensive survey , 2020, Pattern Recognit..

[13]  Xiao Wu,et al.  TBRNet: Two-Stream BiLSTM Residual Network for Video Action Recognition , 2020, Algorithms.

[14]  Xinyu Jin,et al.  Human Action Recognition Based on Improved Fusion Attention CNN and RNN , 2020, 2020 5th International Conference on Computational Intelligence and Applications (ICCIA).

[15]  Med Salim Bouhlel,et al.  A new hybrid deep learning model for human action recognition , 2020, J. King Saud Univ. Comput. Inf. Sci..

[16]  Hassan Foroosh,et al.  Self-Attention Network for Skeleton-based Human Action Recognition , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[17]  Xin Ma,et al.  An End to End Framework With Adaptive Spatio-Temporal Attention Module for Human Action Recognition , 2020, IEEE Access.

[18]  Lin Liu,et al.  Learning Long-Term Temporal Features With Deep Neural Networks for Human Action Recognition , 2020, IEEE Access.

[19]  Yongqiang Li,et al.  Multiple stream deep learning model for human action recognition , 2020, Image Vis. Comput..

[20]  Cheng Dai,et al.  Human action recognition using two-stream attention based LSTM networks , 2020, Appl. Soft Comput..

[21]  Guodong Guo,et al.  A survey on deep learning based face recognition , 2019, Comput. Vis. Image Underst..

[22]  Rizard Renanda Adhi Pramono,et al.  Hierarchical Self-Attention Network for Action Localization in Videos , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Dapeng Wu,et al.  Deep spectral feature pyramid in the frequency domain for long-term action recognition , 2019, J. Vis. Commun. Image Represent..

[24]  Yixin Chen,et al.  Attention with structure regularization for action recognition , 2019, Comput. Vis. Image Underst..

[25]  Dong Cao,et al.  Action Recognition in Untrimmed Videos with Composite Self-Attention Two-Stream Framework , 2019, ACPR.

[26]  Wen-Hsien Fang,et al.  Three-Stream Network With Bidirectional Self-Attention for Action Recognition in Extreme Low Resolution Videos , 2019, IEEE Signal Processing Letters.

[27]  Zhibin Zhao,et al.  Group sparse regularization for impact force identification in time domain , 2019, Journal of Sound and Vibration.

[28]  Bing Wei,et al.  Long-term 3D Convolutional Fusion Network for Action Recognition , 2019, 2019 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA).

[29]  Kaiming He,et al.  Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Huafeng Chen,et al.  Action Recognition Using Visual Attention with Reinforcement Learning , 2018, MMM.

[31]  Heng Tao Shen,et al.  Order-aware Convolutional Pooling for Video Based Action Recognition , 2016, Pattern Recognit..

[32]  Jeremy S. Smith,et al.  Multibranch Attention Networks for Action Recognition in Still Images , 2018, IEEE Transactions on Cognitive and Developmental Systems.

[33]  Weiming Shen,et al.  Agent-Oriented Cooperative Smart Objects: From IoT System Design to Implementation , 2018, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[34]  Anoop Cherian,et al.  Second-order Temporal Pooling for Action Recognition , 2017, International Journal of Computer Vision.

[35]  Anthony Fleury,et al.  Features and Classification Schemes for View-Invariant and Real-Time Human Action Recognition , 2018, IEEE Transactions on Cognitive and Developmental Systems.

[36]  Yunde Jia,et al.  Content-Attention Representation by Factorized Action-Scene Network for Action Recognition , 2018, IEEE Transactions on Multimedia.

[37]  Thomas Brox,et al.  ECO: Efficient Convolutional Network for Online Video Understanding , 2018, ECCV.

[38]  Wenjun Zeng,et al.  Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection , 2018, IEEE Transactions on Image Processing.

[39]  Stephen Gould,et al.  Second-order Temporal Pooling for Action Recognition , 2017, International Journal of Computer Vision.

[40]  Yi Zhu,et al.  Hidden Two-Stream Convolutional Networks for Action Recognition , 2017, ACCV.

[41]  Eric Granger,et al.  Multiple instance learning: A survey of problem characteristics and applications , 2016, Pattern Recognit..

[42]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Cheng Li,et al.  High Dimensional Bayesian Optimization using Dropout , 2018, IJCAI.

[44]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Gaurav Sharma,et al.  AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[47]  Gang Sun,et al.  A Key Volume Mining Deep Framework for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[49]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[50]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[51]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.