Online robust action recognition based on a hierarchical model

Action recognition solely based on video data has known to be very sensitive to background activity, and also lacks the ability to discriminate complex 3D motion. With the development of commercial depth cameras, skeleton-based action recognition is becoming more and more popular. However, the skeleton-based approach is still very challenging because of the large variation in human actions and temporal dynamics. In this paper, we propose a hierarchical model for action recognition. To handle confusing motions, a motion-based grouping method is proposed, which can efficiently assign each video a group label, and then for each group, a pre-trained classifier is used for frame-labeling. Unlike previous methods, we adopt a bottom-up approach that first performs action recognition for each frame. The final action label is obtained by fusing the classification to its frames, with the effect of each frame being adaptively adjusted based on its local properties. To achieve online real-time performance and suppressing noise, bag-of-words is used to represent the classification features. The proposed method is evaluated using two challenge datasets captured by a Kinect. Experiments show that our method can robustly recognize actions in real-time.

[1]  Martial Hebert,et al.  Trajectons: Action recognition through the motion analysis of tracked features , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[2]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[3]  Joseph J. LaViola,et al.  Exploring the Trade-off Between Accuracy and Observational Latency in Action Recognition , 2013, International Journal of Computer Vision.

[4]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Sotirios Chatzis,et al.  A conditional random field-based model for joint sequence segmentation and classification , 2013, Pattern Recognit..

[6]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Aytül Erçil,et al.  A Decision Forest Based Feature Selection Framework for Action Recognition from RGB-Depth Cameras , 2013, ICIAR.

[8]  Helena M. Mentis,et al.  Instructing people for training gestural interactive systems , 2012, CHI.

[9]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[10]  Eli Shechtman,et al.  Space-Time Behavior-Based Correlation-OR-How to Tell If Two Underlying Motion Fields Are Similar Without Computing Them? , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Xiaodong Yang,et al.  Recognizing actions using depth motion maps-based histograms of oriented gradients , 2012, ACM Multimedia.

[12]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[13]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[15]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[16]  Greg Mori,et al.  Action recognition by learning mid-level motion features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.