Metric Learning from Poses for Temporal Clustering of Human Motion

Temporal clustering of human motion into semantically meaningful behaviors is a challenging task. While unsupervised methods do well to some extent, the obtained clusters often lack a semantic interpretation. In this paper, we propose to learn what makes a sequence of human poses different from others such that it should be annotated as an action. To this end, we formulate the problem as weakly supervised temporal clustering for an unknown number of clusters. Weak supervision is attained by learning a metric from the implicit semantic distances derived from already annotated databases. Such a metric contains some low-level semantic information that can be used to effectively segment a human motion sequence into distinct actions or behaviors. The main advantage of our approach is that metrics can be successfully used across datasets, making our method a compelling alternative to unsupervised methods. Experiments on publicly available mocap datasets show the effectiveness of our approach.

[1]  Luc Van Gool,et al.  Does Human Action Recognition Benefit from Pose Estimation? , 2011, BMVC.

[2]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[3]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[4]  Stefano Soatto,et al.  Spike train driven dynamical models for human actions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Mario Sznaier,et al.  Sequential sparsification for change detection , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[7]  Bobby Bodenheimer,et al.  An evaluation of a cost metric for selecting transitions between motion segments , 2003, SCA '03.

[8]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[9]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[10]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[11]  Yueting Zhuang,et al.  Perceptual 3D pose distance estimation by boosting relational geometric features , 2009, Comput. Animat. Virtual Worlds.

[12]  Luc Van Gool,et al.  What's going on? Discovering spatio-temporal dependencies in dynamic scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Du Tran,et al.  Human Activity Recognition with Metric Learning , 2008, ECCV.

[14]  Tal Hassner,et al.  One Shot Similarity Metric Learning for Action Recognition , 2011, SIMBAD.

[15]  Jernej Barbic,et al.  Segmenting Motion Capture Data into Distinct Behaviors , 2004, Graphics Interface.

[16]  Jessica K. Hodgins,et al.  Aligned Cluster Analysis for temporal segmentation of human motion , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[17]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[18]  Jessica K. Hodgins,et al.  Hierarchical Aligned Cluster Analysis for Temporal Clustering of Human Motion , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Michael I. Jordan,et al.  Nonparametric Bayesian Learning of Switching Linear Dynamical Systems , 2008, NIPS.

[20]  Taku Komura,et al.  Emulating human perception of motion similarity , 2008, Comput. Animat. Virtual Worlds.

[21]  Yi Yang,et al.  Learning a 3D Human Pose Distance Metric from Geometric Pose Descriptor , 2011, IEEE Transactions on Visualization and Computer Graphics.