Multimodal activity recognition with local block CNN and attention-based spatial weighted CNN

Abstract Deep learning based human activity recognition approach combines spatial and temporal information to complete the recognition task. The temporal information is extracted by optical flow, which is always compensated by the warping method in order to achieve better performance. However, these methods usually take the global feature as the starting point, only consider global information of video frames, and ignore local information that reflects the changes of human behavior, causing the algorithm to be sensitive to the external environment such as occlusion, illumination change. In view of the above problems, this paper fuses the local spatial features of video frames, global spatial features and temporal features to recognize different actions, and further extracts the visual attention weight to make constraint on the global spatial features. Experiments show that the algorithm proposed in this paper has better accuracy compared with the existing methods.

[1]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Qi Tian,et al.  Enhancing Micro-video Understanding by Harnessing External Sounds , 2017, ACM Multimedia.

[3]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[4]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Meng Wang,et al.  Low-Rank Multi-View Embedding Learning for Micro-Video Popularity Prediction , 2018, IEEE Transactions on Knowledge and Data Engineering.

[6]  Xuelong Li,et al.  Modeling Disease Progression via Multisource Multitask Learners: A Case Study With Alzheimer’s Disease , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[7]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Gang Hua,et al.  Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition , 2018, AIAI.

[9]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[10]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[11]  Yi Yang,et al.  Beyond Doctors: Future Health Prediction from Multimedia and Multimodal Observations , 2015, ACM Multimedia.

[12]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Meng Wang,et al.  Learning User Attributes via Mobile Social Multimedia Analytics , 2017, ACM Trans. Intell. Syst. Technol..

[15]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[16]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Bingbing Ni,et al.  Motion Part Regularization: Improving action recognition via trajectory group selection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Limin Wang,et al.  Computer Vision and Image Understanding Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice , 2022 .

[19]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[20]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.