Predicting Dynamical Evolution of Human Activities from a Single Image

A human pose often conveys not only the configuration of the body parts, but also implicit predictive information about the ensuing motion. This dynamic information can benefit vision applications which lack explicit motion cues. The human visual system can easily perceive the dynamic information in still images. However, computational algorithms to infer and utilize it in computer vision applications are limited. In this paper, we propose a probabilistic framework to infer the dynamic information associated with a human pose. The inference problem is posed as a nonparametric density estimation problem on a non-Euclidean manifold of linear dynamical models. Since direct modeling is intractable, we develop a data driven approach, estimating the density for the test sample under consideration. Statistical inference on the estimated density provides us with quantities of interest like the most probable future motion of the human and the amount of motion information conveyed by a pose. Our experiments demonstrate that the extracted motion information benefits numerous applications in computer vision. In particular, the predicted future motion is useful for activity recognition, human trajectory synthesis, and motion prediction.

[1]  S. Sathiya Keerthi,et al.  Large scale semi-supervised linear SVMs , 2006, SIGIR.

[2]  Václav Hlavác,et al.  Pose primitive based human action recognition in videos or still images , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Y. Chikuse Statistics on special manifolds , 2003 .

[4]  M. Osaka,et al.  Implied motion because of instability in Hokusai Manga activates the human motion-sensitive extrastriate visual cortex: an fMRI study of the impact of visual art , 2010, Neuroreport.

[5]  Yang Wang,et al.  Recognizing human actions from still images with latent poses , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Martial Hebert,et al.  The Pose Knows: Video Forecasting by Generating Pose Futures , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Hasan Ertan Ceting Intrinsic Mean Shift for Clustering on Stiefel and Grassmann Manifolds , 2009 .

[8]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[9]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[10]  Antonio Torralba,et al.  Nonparametric scene parsing: Label transfer via dense scene alignment , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Dirk Kerzel,et al.  A matter of design: No representational momentum without predictability , 2002 .

[13]  Patrick Cavanagh,et al.  Perception of biological motion in parietal patients , 2003, Neuropsychologia.

[14]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[16]  Ibrahim A. Ahmad,et al.  A nonparametric estimation of the entropy for absolutely continuous distributions (Corresp.) , 1976, IEEE Trans. Inf. Theory.

[17]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[18]  Ruben Villegas,et al.  Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[19]  William T. Freeman,et al.  Example-Based Super-Resolution , 2002, IEEE Computer Graphics and Applications.

[20]  Guillaume Morel,et al.  How can human motion prediction increase transparency? , 2008, 2008 IEEE International Conference on Robotics and Automation.

[21]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[23]  Michael J. Black,et al.  On the Spatial Statistics of Optical Flow , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[24]  Yang Wang,et al.  Unsupervised Discovery of Action Classes , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[25]  Alexei A. Efros,et al.  Scene completion using millions of photographs , 2007, SIGGRAPH 2007.

[26]  Larry S. Davis,et al.  Action recognition using ballistic dynamics , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Søren Hauberg,et al.  Predicting Articulated Human Motion from Spatial Processes , 2011, International Journal of Computer Vision.

[28]  Pinar Duygulu Sahin,et al.  Recognizing actions from still images , 2008, 2008 19th International Conference on Pattern Recognition.

[29]  Rémi Ronfard,et al.  Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[30]  Alexei A. Efros,et al.  IM2GPS: estimating geographic information from a single image , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  K. Hiraki,et al.  The relative importance of spatial versus temporal structure in the perception of biological motion: An event-related potential study , 2006, Cognition.

[32]  Antonio Torralba,et al.  A Data-Driven Approach for Event Prediction , 2010, ECCV.

[33]  Yaser Sheikh,et al.  PixelNN: Example-based Image Synthesis , 2017, ICLR.

[34]  J. Lange,et al.  Visual perception of biological motion by form: a template-matching analysis. , 2006, Journal of vision.

[35]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Nazli Ikizler-Cinbis,et al.  Learning actions from the Web , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[37]  Rama Chellappa,et al.  Statistical analysis on Stiefel and Grassmann manifolds with applications in computer vision , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  D. Jacobs,et al.  Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch , 2011, CVPR 2011.

[39]  Yixin Chen,et al.  Automatic Feature Decomposition for Single View Co-training , 2011, ICML.

[40]  N. Kanwisher,et al.  Activation in Human MT/MST by Static Images with Implied Motion , 2000, Journal of Cognitive Neuroscience.

[41]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[42]  Peter Meer,et al.  Simultaneous multiple 3D motion estimation via mode finding on Lie groups , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[43]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .