3D Human Motion Analysis in Monocular Video Techniques and Challenges

Human motion and action analysis in video is an actively growing field with a broad spectrum of applications including video browsing and indexing, entertainment, virtual reality, human-computer interaction and surveillance. However, human motion analysis systems face important scientific and computational challenges. The proportions of the human body vary across individuals due to gender, weight, age or race. Aside from this variability, any single human body has many degrees of freedom due to articulation and the individual limbs are deformable due to muscle and clothing. Finally many realworld scenes involve multiple interacting humans occluded by each other or by other objects, and the scene conditions may also vary due to the camera motion or lighting changes. These factors make accurate 3D human models difficult to build and difficult to reconstruct reliably from ?flat? 2D images. During this talk I will discuss learning and inference algorithms for estimating 3D human motion in monocular video. While the problem has been traditionally approached using the powerful machinery of generative models, operating in an analysis by synthesis loop, the main emphasis of this talk will be on an emerging class of complementary discriminative temporal estimation models. These can be viewed as upside down, bottom-up versions of classical temporal models used with Kalman filtering or particle filtering. But instead of inverting a generative imaging model, we will learn to cooperatively predict complex, feedforward 2D-to-3D mappings, using Conditional Bayesian Mixtures of Experts. These are embedded in a probabilistic temporal framework in order to enforce dynamic constraints and allow a principled propagation of uncertainty. We call the resulting model BM3E (a Conditional Bayesian Mixture of Experts Markov Model). During the talk, I will discuss how inference can be restricted, for efficiency, to low-dimensional non-linear state spaces, and how the framework can be extended in order to deal with clutter and occlusion. I will also discuss the relative advantages of generative and initialize and recover from failure.

[1]  Michael E. Tipping Sparse Bayesian Learning and the Relevance Vector Machine , 2001, J. Mach. Learn. Res..

[2]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[3]  Hsi-Jian Lee,et al.  Determination of 3D human body postures from a single view , 1985, Comput. Vis. Graph. Image Process..

[4]  Rómer Rosales,et al.  Learning Body Pose via Specialized Maps , 2001, NIPS.

[5]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[6]  Roberto Cipolla,et al.  Real-time tracking of highly articulated structures in the presence of noisy measurements , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[7]  Ankur Agarwal,et al.  Monocular Human Motion Capture with a Mixture of Regressors , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Workshops.

[8]  James M. Rehg,et al.  A multiple hypothesis approach to figure tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[9]  David J. Fleet,et al.  Priors for people tracking from small training sets , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[10]  David J. Fleet,et al.  People tracking using hybrid Monte Carlo filtering , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[11]  Cristian Sminchisescu,et al.  Learning Joint Top-Down and Bottom-up Processes for 3D Visual Inference , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[12]  Cristian Sminchisescu,et al.  Kinematic jump processes for monocular 3D human tracking , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[13]  Bernhard Schölkopf,et al.  Kernel Dependency Estimation , 2002, NIPS.

[14]  Dariu Gavrila,et al.  The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[15]  Cristian Sminchisescu,et al.  BM³E : Discriminative Density Propagation for Visual Tracking , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Michael Isard,et al.  Tracking loose-limbed people , 2004, CVPR 2004.

[17]  Jitendra Malik,et al.  Tracking people with twists and exponential maps , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[18]  Andrew Blake,et al.  Articulated body motion capture by annealed particle filtering , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[19]  David J. C. MacKay,et al.  The Evidence Framework Applied to Classification Networks , 1992, Neural Computation.

[20]  Cristian Sminchisescu,et al.  Building Roadmaps of Local Minima of Visual Models , 2002, ECCV.

[21]  Bhaskar D. Rao,et al.  Perspectives on Sparse Bayesian Learning , 2003, NIPS.

[22]  Steve R. Waterhouse,et al.  Bayesian Methods for Mixtures of Experts , 1995, NIPS.

[23]  Cristian Sminchisescu,et al.  Density Propagation for Continuous Temporal Chains Generative and Discriminative Models , 2004 .

[24]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[25]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[26]  Hans-Peter Seidel,et al.  Free-viewpoint video of human actors , 2003, ACM Trans. Graph..

[27]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[28]  Cristian Sminchisescu,et al.  Training Deformable Models for Localization , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[29]  Cristian Sminchisescu,et al.  Discriminative density propagation for 3D human motion estimation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[30]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[31]  William T. Freeman,et al.  Bayesian Reconstruction of 3D Human Motion from Single-Camera Video , 1999, NIPS.

[32]  A. Elgammal,et al.  Inferring 3D body pose from silhouettes using activity manifold learning , 2004, CVPR 2004.

[33]  Cristian Sminchisescu,et al.  Conditional Random Fields for Contextual Human Motion Recognition , 2005, ICCV.

[34]  Ioannis A. Kakadiaris,et al.  Model-based estimation of 3D human motion with occlusion based on active multi-viewpoint selection , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[35]  M. Lee,et al.  Proposal maps driven MCMC for estimating human body pose in static images , 2004, CVPR 2004.

[36]  Cristian Sminchisescu,et al.  Generalized Darting Monte Carlo , 2007, AISTATS.

[37]  Zoran Popovic,et al.  The space of human body shapes: reconstruction and parameterization from range scans , 2003, ACM Trans. Graph..

[38]  Matthew Brand,et al.  Shadow puppetry , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[39]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[40]  Jitendra Malik,et al.  Estimating Human Body Configurations Using Shape Context Matching , 2002, ECCV.

[41]  Carlo Tomasi,et al.  3D tracking = classification + interpolation , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[42]  Michael Isard,et al.  A Smoothing Filter for CONDENSATION , 1998, ECCV.

[43]  Cristian Sminchisescu,et al.  Generative modeling for continuous non-linearly embedded visual inference , 2004, ICML.

[44]  Camillo J. Taylor,et al.  Reconstruction of articulated objects from point correspondences in a single uncalibrated image , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[45]  Michael J. Black,et al.  Gibbs likelihoods for Bayesian tracking , 2004, CVPR 2004.

[46]  Dana H. Ballard,et al.  Computer Vision , 1982 .

[47]  M. Bertero,et al.  Ill-posed problems in early vision , 1988, Proc. IEEE.

[48]  Luc Van Gool,et al.  Full body tracking from multiple views using stochastic sampling , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[49]  Michael Isard,et al.  ICONDENSATION: Unifying Low-Level and High-Level Tracking in a Stochastic Framework , 1998, ECCV.

[50]  Michael J. Black,et al.  Learning image statistics for Bayesian tracking , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[51]  Hans-Hellmut Nagel,et al.  Tracking Persons in Monocular Image Sequences , 1999, Comput. Vis. Image Underst..

[52]  D. Donoho,et al.  Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[53]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[54]  S. Duane,et al.  Hybrid Monte Carlo , 1987 .

[55]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[56]  Cristian Sminchisescu,et al.  Hyperdynamics Importance Sampling , 2002, ECCV.

[57]  Cristian Sminchisescu Consistency and coupling in human model likelihoods , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[58]  N. Gordon,et al.  Novel approach to nonlinear/non-Gaussian Bayesian state estimation , 1993 .

[59]  C. Sminchisescu,et al.  Variational mixture smoothing for non-linear dynamical systems , 2004, CVPR 2004.

[60]  David J. Fleet,et al.  Stochastic Tracking of 3D Human Figures Using 2D Image Motion , 2000, ECCV.

[61]  Cristian Sminchisescu,et al.  Estimating Articulated Human Motion with Covariance Scaled Sampling , 2003, Int. J. Robotics Res..

[62]  Daniel P. Huttenlocher,et al.  Beyond trees: common-factor models for 2D human pose recovery , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.