Moving Poselets: A Discriminative and Interpretable Skeletal Motion Representation for Action Recognition

Given a video or time series of skeleton data, action recognition systems perform classification using cues such as motion, appearance, and pose. For the past decade, actions have been modeled using low-level feature representations such as Bag of Features. More recent work has shown that mid-level representations that model body part movements (e.g., hand moving forward) can be very effective. However, these mid-level features are usually hand-crafted and the dictionary of representative features is learned using ad-hoc heuristics. While automatic feature learning methods such as supervised sparse dictionary learning or neural networks can be applied to learn feature representation and action classifiers jointly, the resulting features are usually uninterpretable. In contrast, our goal is to develop a principled feature learning framework to learn discriminative and interpretable skeletal motion patterns for action recognition. For this purpose, we propose a novel body-part motion based feature called Moving Poselet, which corresponds to a specific body part configuration undergoing a specific movement. We also propose a simple algorithm for jointly learning Moving Poselets and action classifiers. Experiments on MSR Action3D, MSR DailyActivity3D and Berkeley MHAD datasets show that our two-layer model outperforms other two-layer models using hand-crafted features, and achieves results comparable to those of recent multi-layer Hierarchical Recurrent Neural Network (HRNN) models, which use multiple layers of RNN to model the human body hierarchy.

[1]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[3]  Juan Carlos Niebles,et al.  Discriminative Hierarchical Modeling of Spatio-temporally Composable Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  René Vidal,et al.  Visual Dictionary Learning for Joint Object Categorization and Segmentation , 2012, ECCV.

[5]  Ruzena Bajcsy,et al.  Berkeley MHAD: A comprehensive Multimodal Human Action Database , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[6]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[7]  乔宇 Motionlets: Mid-Level 3D Parts for Human Motion Recognition , 2013 .

[8]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[9]  Ruzena Bajcsy,et al.  Bio-inspired Dynamic 3D Discriminative Skeletal Features for Human Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[10]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Iasonas Kokkinos,et al.  Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Christian Bauckhage,et al.  Efficient Pose-Based Action Recognition , 2014, ACCV.

[14]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[16]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[19]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[20]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[21]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Zhuowen Tu,et al.  Action Recognition with Actons , 2013, 2013 IEEE International Conference on Computer Vision.

[23]  Cristian Sminchisescu,et al.  The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[24]  Ruzena Bajcsy,et al.  Sequence of the Most Informative Joints (SMIJ): A new representation for human skeletal action recognition , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[25]  Guillermo Sapiro,et al.  Discriminative learned dictionaries for local image analysis , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[27]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[28]  René Vidal,et al.  Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[29]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[30]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[31]  Weiyu Zhang,et al.  From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[32]  Alan L. Yuille,et al.  An Approach to Pose-Based Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.