Skeleton Based Temporal Action Detection with YOLO

Detecting actions in untrimmed sequences is an important yet challenging task. In this paper, we innovatively transform the temporal action detection issue into the object detection issue. Our method allows for real-time detection and end-to-end training. It consists of two stages. Firstly, we propose an idea to represent action sequences as images as well as preserving the original temporal dynamics and spatial structure information. Secondly, based on such description, we design a one-dimensional YOLO network to detect human action. In addition, we make a dataset for skeleton based temporal action detection. Experiments on our dataset demonstrate the superiority of our method.

[1]  Peng Wang,et al.  Temporal Pyramid Pooling-Based Convolutional Neural Network for Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Zhengyou Zhang,et al.  Microsoft Kinect Sensor and Its Effect , 2012, IEEE Multim..

[5]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).