ASNet: Auto-Augmented Siamese Neural Network for Action Recognition

Human action recognition methods in videos based on deep convolutional neural networks usually use random cropping or its variants for data augmentation. However, this traditional data augmentation approach may generate many non-informative samples (video patches covering only a small part of the foreground or only the background) that are not related to a specific action. These samples can be regarded as noisy samples with incorrect labels, which reduces the overall action recognition performance. In this paper, we attempt to mitigate the impact of noisy samples by proposing an Auto-augmented Siamese Neural Network (ASNet). In this framework, we propose backpropagating salient patches and randomly cropped samples in the same iteration to perform gradient compensation to alleviate the adverse gradient effects of non-informative samples. Salient patches refer to the samples containing critical information for human action recognition. The generation of salient patches is formulated as a Markov decision process, and a reinforcement learning agent called SPA (Salient Patch Agent) is introduced to extract patches in a weakly supervised manner without extra labels. Extensive experiments were conducted on two well-known datasets UCF-101 and HMDB-51 to verify the effectiveness of the proposed SPA and ASNet.

[1]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Jun Kong,et al.  Spatial-temporal saliency action mask attention network for action recognition , 2020, J. Vis. Commun. Image Represent..

[3]  Yu Qiao,et al.  Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos , 2018, IEEE Transactions on Image Processing.

[4]  Xiaoyan Sun,et al.  Spatiotemporal Fusion in 3D CNNs: A Probabilistic View , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[7]  Song Han,et al.  Temporal Shift Module for Efficient Video Understanding , 2018, ArXiv.

[8]  Wenhao Wu,et al.  Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[10]  Kate Saenko,et al.  AR-Net: Adaptive Frame Resolution for Efficient Action Recognition , 2020, ECCV.

[11]  Jae-Gil Lee,et al.  Learning from Noisy Labels with Deep Neural Networks: A Survey , 2020, ArXiv.

[12]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Yongfa Li,et al.  AR3D: Attention Residual 3D Network for Human Action Recognition , 2021, Sensors.

[14]  Zhenbing Liu,et al.  Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition , 2020, Neural Computing and Applications.

[15]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Xiaojun Chang,et al.  Reinforcement Cutting-Agent Learning for Video Object Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[18]  Cristian Sminchisescu,et al.  Deep Reinforcement Learning of Region Proposal Networks for Object Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Qiang Liu,et al.  KeepAugment: A Simple Information-Preserving Data Augmentation Approach , 2020, Computer Vision and Pattern Recognition.

[20]  Jacob Goldberger,et al.  Training deep neural-networks based on unreliable labels , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[22]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Tieniu Tan,et al.  Attention-Aware Sampling via Deep Reinforcement Learning for Action Recognition , 2019, AAAI.

[24]  Bin Kang,et al.  TEA: Temporal Excitation and Aggregation for Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[26]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Graham W. Taylor,et al.  Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[28]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[29]  Feiyue Huang,et al.  TEINet: Towards an Efficient Architecture for Video Recognition , 2019, AAAI.

[30]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[31]  Christoph Feichtenhofer,et al.  X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[33]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[34]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[36]  Nir Shavit,et al.  Deep Learning is Robust to Massive Label Noise , 2017, ArXiv.

[37]  Baoxin Li,et al.  Semantic Cues Enhanced Multimodality Multistream CNN for Action Recognition , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[38]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Ivan Laptev,et al.  Efficient Feature Extraction, Encoding, and Classification for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Jiwen Lu,et al.  Collaborative Deep Reinforcement Learning for Multi-object Tracking , 2018, ECCV.

[41]  Razvan Pascanu,et al.  Deep Learners Benefit More from Out-of-Distribution Examples , 2011, AISTATS.

[42]  Huikai Liu,et al.  STAC: Spatial-Temporal Attention on Compensation Information for Activity Recognition in FPV , 2021, Sensors.

[43]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Richard Nock,et al.  Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Fadi Al Machot,et al.  A review on applications of activity recognition systems with regard to performance and evaluation , 2016, Int. J. Distributed Sens. Networks.

[47]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Wuzhao Li,et al.  Attention-Based Temporal Encoding Network with Background-Independent Motion Mask for Action Recognition , 2021, Comput. Intell. Neurosci..

[49]  Quanfu Fan,et al.  More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation , 2019, NeurIPS.

[50]  Cordelia Schmid,et al.  MARS: Motion-Augmented RGB Stream for Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Xinyu Li,et al.  Directional Temporal Modeling for Action Recognition , 2020, ECCV.

[52]  Kaiqi Huang,et al.  A2-RL: Aesthetics Aware Reinforcement Learning for Image Cropping , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Heng Wang,et al.  Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[55]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[57]  Xiao Liu,et al.  StNet: Local and Global Spatial-Temporal Modeling for Action Recognition , 2018, AAAI.

[58]  Xiaoyan Sun,et al.  Mutually Reinforced Spatio-Temporal Convolutional Tube for Human Action Recognition , 2019, IJCAI.

[59]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[60]  Wei Wu,et al.  STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[61]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[62]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[63]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[64]  Baoxin Li,et al.  Multi-stream CNN: Learning representations based on human-related regions for action recognition , 2018, Pattern Recognit..

[65]  Azeddine Beghdadi,et al.  Spatio-temporal action localization and detection for human action recognition in big dataset , 2016, J. Vis. Commun. Image Represent..

[66]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[67]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[68]  Limin Wang,et al.  Dynamic Sampling Networks for Efficient Action Recognition in Videos , 2020, IEEE Transactions on Image Processing.

[69]  Shuchang Zhou,et al.  Learning to Paint With Model-Based Deep Reinforcement Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[70]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  TaeChoong Chung,et al.  SaliencyMix: A Saliency Guided Data Augmentation Strategy for Better Regularization , 2020, ICLR.

[72]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[73]  Mengyang Liu,et al.  Data-level information enhancement: Motion-patch-based Siamese Convolutional Neural Networks for human activity recognition in videos , 2020, Expert Syst. Appl..

[74]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[75]  Ezzeddine Zagrouba,et al.  Abnormal behavior recognition for intelligent video surveillance systems: A review , 2018, Expert Syst. Appl..

[76]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[77]  Peter Corcoran,et al.  Smart Augmentation Learning an Optimal Data Augmentation Strategy , 2017, IEEE Access.

[78]  Hao Jiang,et al.  Whole and Part Adaptive Fusion Graph Convolutional Networks for Skeleton-Based Action Recognition , 2020, Sensors.

[79]  Weilin Huang,et al.  V4D: 4D Convolutional Neural Networks for Video-level Representation Learning , 2020, ICLR.

[80]  Yunhao Tang,et al.  Discretizing Continuous Action Space for On-Policy Optimization , 2019, AAAI.

[81]  Huafeng Chen,et al.  Action recognition by saliency-based dense sampling , 2017, Neurocomputing.

[82]  Rui Nian,et al.  A review On reinforcement learning: Introduction and applications in industrial process control , 2020, Comput. Chem. Eng..