Coupled Action Recognition and Pose Estimation from Multiple Views

Action recognition and pose estimation are two closely related topics in understanding human body movements; information from one task can be leveraged to assist the other, yet the two are often treated separately. We present here a framework for coupled action recognition and pose estimation by formulating pose estimation as an optimization over a set of action-specific manifolds. The framework allows for integration of a 2D appearance-based action recognition system as a prior for 3D pose estimation and for refinement of the action labels using relational pose features based on the extracted 3D poses. Our experiments show that our pose estimation system is able to estimate body poses with high degrees of freedom using very few particles and can achieve state-of-the-art results on the HumanEva-II benchmark. We also thoroughly investigate the impact of pose estimation and action recognition accuracy on each other on the challenging TUM kitchen dataset. We demonstrate not only the feasibility of using extracted 3D poses for action recognition, but also improved performance in comparison to action recognition using low-level appearance features.

[1]  Michael Gleicher,et al.  Automated extraction and parameterization of motions in large data sets , 2004, SIGGRAPH 2004.

[2]  Hans-Peter Seidel,et al.  Optimization and Filtering for Human Motion Capture , 2010, International Journal of Computer Vision.

[3]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[4]  Neil A. Thacker,et al.  Real-time Body Tracking Using a Gaussian Process Latent Variable Model , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[5]  Rui Li,et al.  Simultaneous Learning of Nonlinear Manifold and Dynamical Models for High-dimensional Time Series , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[6]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Cristian Sminchisescu,et al.  BM³E : Discriminative Density Propagation for Visual Tracking , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  P. Moral Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications , 2004 .

[9]  Trevor Darrell,et al.  Conditional Random People: Tracking Humans with CRFs and Grid Filters , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[10]  Vladimir Pavlovic,et al.  Learning Switching Linear Models of Human Motion , 2000, NIPS.

[11]  Ramakant Nevatia,et al.  Learning 3D action models from a few 2D videos for view invariant action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Brian Jefferies Feynman-Kac Formulae , 1996 .

[13]  Neil D. Lawrence,et al.  Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models , 2005, J. Mach. Learn. Res..

[14]  David C. Hogg,et al.  An efficient method for contour tracking using active shape models , 1994, Proceedings of 1994 IEEE Workshop on Motion of Non-rigid and Articulated Objects.

[15]  Edmond Boyer,et al.  Action recognition using exemplar-based embedding , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[17]  Aaron F. Bobick,et al.  Recognition of human body motion using phase space constraints , 1995, Proceedings of IEEE International Conference on Computer Vision.

[18]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  James F. O'Brien,et al.  Computational Studies of Human Motion , 2006 .

[20]  Ramakant Nevatia,et al.  Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[22]  Hans-Peter Seidel,et al.  Scaled Motion Dynamics for Markerless Motion Capture , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Mubarak Shah,et al.  View-Invariant Representation and Recognition of Actions , 2002, International Journal of Computer Vision.

[24]  Bernt Schiele,et al.  Monocular 3D pose estimation and tracking by detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  S. Mitra,et al.  Gesture Recognition: A Survey , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[27]  Ehud Rivlin,et al.  Dimensionality reduction using a Gaussian Process Annealed Particle Filter for tracking and classification of articulated body motions , 2011, Comput. Vis. Image Underst..

[28]  Luc Van Gool,et al.  Exemplar-based Action Recognition in Video , 2009, BMVC.

[29]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[30]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[31]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Narendra Ahuja,et al.  Learning Nonlinear Manifolds from Time Series , 2006, ECCV.

[33]  Hans-Peter Seidel,et al.  Motion capture using joint skeleton tracking and surface estimation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Rémi Ronfard,et al.  Action Recognition from Arbitrary Views using 3D Exemplars , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[35]  Rui Li,et al.  3D Human Motion Tracking with a Coordinated Mixture of Factor Analyzers , 2009, International Journal of Computer Vision.

[36]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[37]  Cristian Sminchisescu,et al.  Twin Gaussian Processes for Structured Prediction , 2010, International Journal of Computer Vision.

[38]  Trevor Darrell,et al.  Rank priors for continuous non-linear dimensionality reduction , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Václav Hlavác,et al.  Pose primitive based human action recognition in videos or still images , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[41]  Juergen Gall,et al.  International Journal of Computer Vision manuscript No. (will be inserted by the editor) Optimization and Filtering for Human Motion Capture A Multi-layer Framework , 2022 .

[42]  A. Elgammal,et al.  Inferring 3D body pose from silhouettes using activity manifold learning , 2004, CVPR 2004.

[43]  David J. Fleet,et al.  Stochastic Tracking of 3D Human Figures Using 2D Image Motion , 2000, ECCV.

[44]  Masatsugu Kidode,et al.  Complex volume and pose tracking with probabilistic dynamical models and visual hull constraints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[45]  Tido Röder,et al.  Efficient content-based retrieval of motion capture data , 2005, SIGGRAPH 2005.

[46]  Ahmed M. Elgammal,et al.  Coupled Visual and Kinematic Manifold Models for Tracking , 2010, International Journal of Computer Vision.

[47]  Adrian Hilton,et al.  A survey of advances in vision-based human motion capture and analysis , 2006, Comput. Vis. Image Underst..

[48]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[49]  David J. Fleet,et al.  Priors for people tracking from small training sets , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[50]  Luc Van Gool,et al.  A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[51]  Svetha Venkatesh,et al.  A Study on Smoothing for Particle-Filtered 3D Human Body Tracking , 2010, International Journal of Computer Vision.

[52]  Luc Van Gool,et al.  Hough Forests for Object Detection, Tracking, and Action Recognition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Bodo Rosenhahn,et al.  Region-based pose tracking with occlusions using 3D models , 2010, Machine Vision and Applications.

[54]  Larry S. Davis,et al.  Towards 3-D model-based tracking and recognition of human movement: a multi-view approach , 1995 .

[55]  Cordelia Schmid,et al.  Human Focused Action Localization in Video , 2010, ECCV Workshops.

[56]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[57]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[58]  Hans-Peter Seidel,et al.  Stabilizing motion tracking using retrieved motion priors , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[59]  Yang Wang,et al.  Recognizing human actions from still images with latent poses , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[60]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[61]  Luc Van Gool,et al.  Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  Michael J. Black,et al.  Implicit Probabilistic Models of Human Motion for Synthesis and Tracking , 2002, ECCV.

[63]  B. Rosenhahn,et al.  Region Based 3-D Pose Tracking with Occlusions , 2010 .

[64]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[65]  Luc Van Gool,et al.  Does Human Action Recognition Benefit from Pose Estimation? , 2011, BMVC.

[66]  John Darby,et al.  Tracking human pose with multiple activity models , 2010, Pattern Recognit..

[67]  David J. Fleet,et al.  3D People Tracking with Gaussian Process Dynamical Models , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[68]  Jixu Chen,et al.  Switching Gaussian Process Dynamic Models for simultaneous composite motion tracking and recognition , 2009, CVPR.

[69]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[70]  Moritz Tenorth,et al.  The TUM Kitchen Data Set of everyday manipulation activities for motion tracking and action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[71]  Luc Van Gool,et al.  Learning Generative Models for Multi-Activity Body Pose Estimation , 2008, International Journal of Computer Vision.

[72]  Odest Chadwicke Jenkins,et al.  Interactive Human Pose and Action Recognition Using Dynamical Motion Primitives , 2007, Int. J. Humanoid Robotics.

[73]  David J. Fleet,et al.  Physics-Based Person Tracking Using the Anthropomorphic Walker , 2010, International Journal of Computer Vision.

[74]  Andrew M. Wallace,et al.  Behavioural Analysis with Movement Cluster Model for Concurrent Actions , 2011, EURASIP J. Image Video Process..

[75]  Hans-Peter Seidel,et al.  A comparison of 3d model-based tracking approaches for human motion capture in uncontrolled environments , 2009, 2009 Workshop on Applications of Computer Vision (WACV).

[76]  Cristian Sminchisescu,et al.  Generative modeling for continuous non-linearly embedded visual inference , 2004, ICML.

[77]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[78]  Subhransu Maji,et al.  Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[79]  Rómer Rosales,et al.  Learning Body Pose via Specialized Maps , 2001, NIPS.

[80]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[81]  Ian D. Reid,et al.  Articulated Body Motion Capture by Stochastic Search , 2005, International Journal of Computer Vision.

[82]  Luc Van Gool,et al.  2D Action Recognition Serves 3D Human Pose Estimation , 2010, ECCV.

[83]  Emiliano Gambaretto,et al.  Markerless Motion Capture through Visual Hull, Articulated ICP and Subject Specific Model Generation , 2010, International Journal of Computer Vision.

[84]  David J. Fleet,et al.  Dynamical binary latent variable models for 3D human pose tracking , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[85]  Ankur Agarwal,et al.  Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[86]  Mubarak Shah,et al.  Recognizing human actions in videos acquired by uncalibrated moving cameras , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[87]  Christoph Schnörr,et al.  A Study of Parts-Based Object Class Detection Using Complete Graphs , 2010, International Journal of Computer Vision.

[88]  Hans-Peter Seidel,et al.  Drift-free tracking of rigid and articulated objects , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[89]  Hans-Peter Seidel,et al.  An Introduction to Interacting Simulated Annealing , 2006, Human Motion.

[90]  Mubarak Shah,et al.  Chaotic Invariants for Human Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[91]  Vladimir Pavlovic,et al.  Impact of Dynamics on Subspace Embedding and Tracking of Sequences , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[92]  Michael J. Black,et al.  Parameterized Modeling and Recognition of Activities , 1999, Comput. Vis. Image Underst..

[93]  Peng Zhang,et al.  Nonlinear Dimensionality Reduction by Locally Linear Inlaying , 2009, IEEE Transactions on Neural Networks.

[94]  David A. Forsyth,et al.  Computational Studies of Human Motion: Part 1, Tracking and Motion Synthesis , 2005, Found. Trends Comput. Graph. Vis..

[95]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.