Visual Action Recognition Using Deep Learning in Video Surveillance Systems

The skeleton tracking technique allows the usage of the skeleton information of human-like objects for action recognition. The major challenge in action recognition in a video surveillance system is the large variability across and within subjects. In this paper, we propose a deep-learning-based novel framework to recognize human actions using skeleton estimation. The main component of the framework consists of pose estimation using a stacked hourglass network (HGN). The pose estimation module provides the skeleton joint points of humans. Since the position of skeleton varies according to the point of view, we apply transformations on the skeleton points to make it invariable to rotation and position. The skeleton joint positions are identified using HGN-based deep neural networks (HGN-DNN), and the feature extraction and classification is carried out to obtain the action class. The skeleton action sequence is encoded using Fisher Vector before classification. The proposed system complies with Recommendation ITU-T H.626.5 "Architecture for intelligent visual surveillance systems", and has been evaluated over benchmarked human action recognition data sets. The evaluation results show that the system performance achieves a precision of 85% and the accuracy of 95.6% in recognizing actions like wave, punch, kick, etc. The HGN-DNN model meets the requirements and service description specified in Recommendation ITU-T F.743.

[1]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[2]  Yongmin Li,et al.  Video classification using spatial-temporal features and PCA , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[3]  Sharath Pankanti,et al.  Heterogeneous Semantic Level Features Fusion for Action Recognition , 2015, ICMR.

[4]  Yichen Wei,et al.  Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Muhammad Younus Javed,et al.  A framework of human detection and action recognition based on uniform segmentation and combination of Euclidean distance and joint entropy-based features selection , 2017, EURASIP J. Image Video Process..

[6]  Megha D. Bengalur Human activity recognition using body pose features and support vector machine , 2013, 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[7]  Haibo Wang,et al.  Shape feature encoding via Fisher Vector for efficient fall detection in depth-videos , 2015, Appl. Soft Comput..