Streamer action recognition in live video with spatial-temporal attention and deep dictionary learning

Abstract Live video hosted by streamer is being sought after by more and more Internet users. A few streamers show inappropriate action in normal live video content for profit and popularity, who bring great harm to the network environment. In order to effectively regulate the streamer behavior in live video, a streamer action recognition method in live video with spatial-temporal attention and deep dictionary learning is proposed in this paper. First, deep features with spatial context are extracted by a spatial attention network to focus on action region of streamer after sampling video frames from live video. Then, deep features of video are fused by assigning weights with a temporal attention network to learn the frame attention from an action. Finally, deep dictionary learning is used to sparsely represent the deep features to further recognize streamer actions. Four experiments are conducted on a real-world dataset, and the competitive results demonstrate that our method can improve the accuracy and speed of streamer action recognition in live video.

[1]  Meng Wang,et al.  Visual Classification by ℓ1-Hypergraph Modeling , 2015, IEEE Trans. Knowl. Data Eng..

[2]  Xindong Wu,et al.  Learning on Big Graph: Label Inference and Regularization with Anchor Hierarchy , 2017, IEEE Transactions on Knowledge and Data Engineering.

[3]  Bo Yuan,et al.  Supervised Online Dictionary Learning for Image Separation Using OMP , 2016, ICIC.

[4]  Luc Van Gool,et al.  UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jeffrey A. Fessler,et al.  Convolutional Dictionary Learning: Acceleration and Convergence , 2017, IEEE Transactions on Image Processing.

[6]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[7]  M. Corbetta,et al.  Control of goal-directed and stimulus-driven attention in the brain , 2002, Nature Reviews Neuroscience.

[8]  Mayank Vatsa,et al.  Deep Dictionary Learning , 2016, IEEE Access.

[9]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[10]  Shuicheng Yan,et al.  Jointly Learning Structured Analysis Discriminative Dictionary and Analysis Multiclass Classifier , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Shuicheng Yan,et al.  Hybrid CNN and Dictionary-Based Models for Scene Recognition and Domain Adaptation , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[12]  Larry S. Davis,et al.  Label Consistent K-SVD: Learning a Discriminative Dictionary for Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[14]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Ling Shao,et al.  SRSC: Selective, Robust, and Supervised Constrained Feature Representation for Image Classification , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[16]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Larry S. Davis,et al.  Learning a discriminative dictionary for sparse coding via label consistent K-SVD , 2011, CVPR 2011.

[18]  Svetha Venkatesh,et al.  Joint learning and dictionary construction for pattern recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Baoxin Li,et al.  Discriminative K-SVD for dictionary learning in face recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[21]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[22]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[23]  Meng Wang,et al.  Scalable Semi-Supervised Learning by Efficient Anchor Graph Regularization , 2016, IEEE Transactions on Knowledge and Data Engineering.

[24]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[25]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[26]  Liyuan Wang,et al.  Porn Streamer Recognition in Live Video Streaming via Attention-Gated Multimodal Deep Features , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[27]  Ke Huang,et al.  Sparse Representation for Signal Classification , 2006, NIPS.

[28]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[29]  Xiaobo Jin,et al.  Attentive Region Embedding Network for Zero-Shot Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Chaoran Cui,et al.  Self-attention driven adversarial similarity learning network , 2020, Pattern Recognit..

[31]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Yap-Peng Tan,et al.  Nonlinear dictionary learning with application to image classification , 2018, Pattern Recognit..

[33]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[34]  Yue Gao,et al.  View-Based Discriminative Probabilistic Modeling for 3D Object Retrieval and Recognition , 2013, IEEE Transactions on Image Processing.

[35]  Nicu Sebe,et al.  Deep Micro-Dictionary Learning and Coding Network , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[36]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).