Dynamic Manifold Warping for view invariant action recognition

We address the problem of learning view-invariant 3D models of human motion from motion capture data, in order to recognize human actions from a monocular video sequence with arbitrary viewpoint. We propose a Spatio-Temporal Manifold (STM) model to analyze non-linear multivariate time series with latent spatial structure and apply it to recognize actions in the joint-trajectories space. Based on STM, a novel alignment algorithm Dynamic Manifold Warping (DMW) and a robust motion similarity metric are proposed for human action sequences, both in 2D and 3D. DMW extends previous works on spatio-temporal alignment by incorporating manifold learning. We evaluate and compare the approach to state-of-the-art methods on motion capture data and realistic videos. Experimental results demonstrate the effectiveness of our approach, which yields visually appealing alignment results, produces higher action recognition accuracy, and can recognize actions from arbitrary views with partial occlusion.

[1]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[2]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[3]  Tanveer F. Syeda-Mahmood,et al.  View-invariant alignment and matching of video sequences , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[4]  Radford M. Neal,et al.  Multiple Alignment of Continuous Time Series , 2004, NIPS.

[5]  Kiriakos N. Kutulakos,et al.  Linear Sequence-to-Sequence Alignment , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[7]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[8]  Neil D. Lawrence,et al.  Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models , 2005, J. Mach. Learn. Res..

[9]  Ramakant Nevatia,et al.  Recognition and Segmentation of 3-D Human Action Using HMM and Multi-class AdaBoost , 2006, ECCV.

[10]  David J. Fleet,et al.  3D People Tracking with Gaussian Process Dynamical Models , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[11]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[12]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2008, International Journal of Computer Vision.

[13]  Ming-Hsuan Yang,et al.  Incremental Learning for Robust Visual Tracking , 2008, International Journal of Computer Vision.

[14]  Rémi Ronfard,et al.  Action Recognition from Arbitrary Views using 3D Exemplars , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[15]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  David J. Fleet,et al.  Topologically-constrained latent variable models , 2008, ICML '08.

[17]  Mubarak Shah,et al.  Learning 4D action feature models for arbitrary view action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Irene Cheng,et al.  Optimization of Symmetric Transfer Error for Sub-frame Video Synchronization , 2008, ECCV.

[19]  Yihong Gong,et al.  Latent Pose Estimator for Continuous Action Recognition , 2008, ECCV.

[20]  Ramakant Nevatia,et al.  View and scale invariant action recognition using multiview shape-flow models , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Tae-Kyun Kim,et al.  Canonical Correlation Analysis of Video Volume Tensors for Action Categorization and Detection , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Fernando De la Torre,et al.  Canonical Time Warping for Alignment of Human Behavior , 2009, NIPS.

[23]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[24]  Laurens van der Maaten,et al.  Learning a Parametric Embedding by Preserving Local Structure , 2009, AISTATS.

[25]  Fernando De la Torre,et al.  Unsupervised discovery of facial events , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[26]  Gérard G. Medioni,et al.  Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting , 2010, J. Mach. Learn. Res..

[27]  Pascal Fua,et al.  Making Action Recognition Robust to Occlusions and Viewpoint Changes , 2010, ECCV.

[28]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..