Action2Motion: Conditioned Generation of 3D Human Motions

Action recognition is a relatively established task, where given an input sequence of human motion, the goal is to predict its action category. This paper, on the other hand, considers a relatively new problem, which could be thought of as an inverse of action recognition: given a prescribed action type, we aim to generate plausible human motion sequences in 3D. Importantly, the set of generated motions are expected to maintain its diversity to be able to explore the entire action-conditioned motion space; meanwhile, each sampled sequence faithfully resembles a natural human body articulation dynamics. Motivated by these objectives, we follow the physics law of human kinematics by adopting the Lie Algebra theory to represent the natural human motions; we also propose a temporal Variational Auto-Encoder (VAE) that encourages a diverse sampling of the motion space. A new 3D human motion dataset, HumanAct12, is also constructed. Empirical experiments over three distinct human motion datasets (including ours) demonstrate the effectiveness of our approach.

[1]  Chi-Keung Tang,et al.  Deep Video Generation, Prediction and Completion of Human Action Sequences , 2017, ECCV.

[2]  Tamim Asfour,et al.  Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks , 2017, Robotics Auton. Syst..

[3]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[4]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[5]  Eduardo de Campos Valadares,et al.  Dancing to the music , 2000 .

[6]  José M. F. Moura,et al.  Adversarial Geometry-Aware Human Motion Prediction , 2018, ECCV.

[7]  Larry S. Davis,et al.  Towards 3-D model-based tracking and recognition of human movement: a multi-view approach , 1995 .

[8]  Necati Cihan Camgoz,et al.  Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks , 2020, International Journal of Computer Vision.

[9]  Seonghyeon Nam,et al.  Unsupervised Keypoint Learning for Guiding Class-Conditional Video Prediction , 2019, NeurIPS.

[10]  Michael J. Black,et al.  VIBE: Video Inference for Human Body Pose and Shape Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ira Kemelmacher-Shlizerman,et al.  Audio to Body Dynamics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Roger Zimmermann,et al.  Towards Natural and Accurate Future Motion Prediction of Humans and Animals , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[14]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Ersin Yumer,et al.  MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics , 2018, ECCV.

[16]  Ira Kemelmacher-Shlizerman,et al.  What Makes Tom Hanks Look Like Tom Hanks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Michael J. Black,et al.  Parameterized Modeling and Recognition of Activities , 1999, Comput. Vis. Image Underst..

[18]  ZhangHao,et al.  Space-time representation of people based on 3D skeletal data , 2017 .

[19]  Zhe Wang,et al.  Pose Guided Human Video Generation , 2018, ECCV.

[20]  Alexandros André Chaaraoui,et al.  Evolutionary joint selection to improve human action recognition with RGB-D devices , 2014, Expert Syst. Appl..

[21]  Richard M. Murray,et al.  A Mathematical Introduction to Robotic Manipulation , 1994 .

[22]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[23]  Raymond J. Mooney,et al.  Generating Animated Videos of Human Activities from Natural Language Descriptions , 2018 .

[24]  Yu Zhang,et al.  Lie-X: Depth Image Based Articulated Object Pose Estimation, Tracking, and Action Recognition on Lie Groups , 2016, International Journal of Computer Vision.

[25]  Sen Wang,et al.  3D Human Shape Reconstruction from a Polarization Image , 2020, ECCV.

[26]  Kazuhiko Sumi,et al.  Speech-to-Gesture Generation: A Challenge in Deep Learning Approach with Bi-Directional LSTM , 2017, HAI.

[27]  Luc Van Gool,et al.  Deep Learning on Lie Groups for Skeleton-Based Action Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jia Jia,et al.  Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis , 2018, ACM Multimedia.

[29]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[30]  Sen Wang,et al.  Polarization Human Shape and Pose Dataset , 2020, ArXiv.

[31]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Marwan Torki,et al.  Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations , 2013, IJCAI.

[33]  Gang Wang,et al.  NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Fei Han,et al.  Space-Time Representation of People Based on 3D Skeletal Data: A Review , 2016, Comput. Vis. Image Underst..

[35]  Louis-Philippe Morency,et al.  Language2Pose: Natural Language Grounded Pose Forecasting , 2019, 2019 International Conference on 3D Vision (3DV).

[36]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Rob Fergus,et al.  Stochastic Video Generation with a Learned Prior , 2018, ICML.

[38]  Timothy Ha,et al.  Text2Action: Generative Adversarial Synthesis from Language to Action , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).