Group-Skeleton-Based Human Action Recognition in Complex Events

Human action recognition as an important application of computer vision has been studied for decades. Among various approaches, skeleton-based methods recently attract increasing attention due to their robust and superior performance. However, existing skeleton-based methods ignore the potential action relationships between different persons, while the action of a person is highly likely to be impacted by another person especially in complex events. In this paper, we propose a novel group-skeleton-based human action recognition method in complex events. This method first utilizes multi-scale spatial-temporal graph convolutional networks (MS-G3Ds) to extract skeleton features from multiple persons. In addition to the traditional key point coordinates, we also input the key point speed values to the networks for better performance. Then we use multilayer perceptrons (MLPs) to embed the distance values between the reference person and other persons into the extracted features. Lastly, all the features are fed into another MS-G3D for feature fusion and classification. For avoiding class imbalance problems, the networks are trained with a focal loss. The proposed algorithm is also our solution for the Large-scale Human-centric Video Analysis in Complex Events Challenge. Results on the HiEve dataset show that our method can give superior performance compared to other state-of-the-art methods.

[1]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[2]  Yichen Wei,et al.  Simple Baselines for Human Pose Estimation and Tracking , 2018, ECCV.

[3]  Lei Shi,et al.  Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[5]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Haoyu Wang,et al.  Pose Flow: Efficient Online Pose Tracking , 2018, BMVC.

[7]  Somayeh Danafar,et al.  Action Recognition for Surveillance Applications Using Optic Flow and SVM , 2007, ACCV.

[8]  Guanghan Ning,et al.  LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[9]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[11]  Nicu Sebe,et al.  Human in Events: A Large-Scale Benchmark for Human-centric Video Analysis in Complex Events , 2020, ArXiv.

[12]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Zhenghao Chen,et al.  Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Xiaopeng Hong,et al.  Learning Graph Convolutional Network for Skeleton-based Human Action Recognition by Neural Searching , 2019, AAAI.

[15]  Cewu Lu,et al.  PaStaNet: Toward Human Activity Knowledge Engine , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Lei Shi,et al.  Skeleton-Based Action Recognition With Directed Graph Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Hao Zhu,et al.  CrowdPose: Efficient Crowded Scenes Pose Estimation and a New Benchmark , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Nanning Zheng,et al.  View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[20]  Quoc V. Le,et al.  EfficientDet: Scalable and Efficient Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Pichao Wang,et al.  Scene Flow to Action Map: A New Representation for RGB-D Based Action Recognition with Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Lei Zhang,et al.  Spatio-temporal SIFT and Its Application to Human Action Classification , 2012, ECCV Workshops.

[25]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Wu Liu,et al.  T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition , 2018, AAAI.

[27]  Hong Liu,et al.  Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..

[28]  Alexander J. Smola,et al.  Compressed Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[30]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[31]  Cewu Lu,et al.  RMPE: Regional Multi-person Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[34]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.