SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation

Our goal is to synthesize 3D human motions given textual inputs describing simultaneous actions, for example 'waving hand' while 'walking' at the same time. We refer to generating such simultaneous movements as performing 'spatial compositions'. In contrast to temporal compositions that seek to transition from one action to another, spatial compositing requires understanding which body parts are involved in which action, to be able to move them simultaneously. Motivated by the observation that the correspondence between actions and body parts is encoded in powerful language models, we extract this knowledge by prompting GPT-3 with text such as"what are the body parts involved in the action?", while also providing the parts list and few-shot examples. Given this action-part mapping, we combine body parts from two motions together and establish the first automated method to spatially compose two actions. However, training data with compositional actions is always limited by the combinatorics. Hence, we further create synthetic data with this approach, and use it to train a new state-of-the-art text-to-motion generation model, called SINC ("SImultaneous actioN Compositions for 3D human motions"). In our experiments, that training with such GPT-guided synthetic data improves spatial composition generation over baselines. Our code is publicly available at https://sinc.is.tue.mpg.de/.

[1]  Yong Zhang,et al.  Generating Human Motion from Textual Descriptions with Discrete Representations , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Kyoung Mu Lee,et al.  MultiAct: Long-Term 3D Human Motion Generation from Multiple Action Labels , 2022, AAAI.

[3]  Gang Yu,et al.  Executing your Commands via Motion Diffusion in Latent Space , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  C. Theobalt,et al.  MoFusion: A Framework for Denoising-Diffusion-Based Motion Synthesis , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Alexei A. Efros,et al.  InstructPix2Pix: Learning to Follow Image Editing Instructions , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yixin Zhu,et al.  HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes , 2022, NeurIPS.

[7]  Amit H. Bermano,et al.  Human Motion Diffusion Model , 2022, ICLR.

[8]  Sungjoon Choi,et al.  FLAME: Free-form Language-based Motion Synthesis & Editing , 2022, AAAI.

[9]  Michael J. Black,et al.  TEACH: Temporal Action Composition for 3D Humans , 2022, 2022 International Conference on 3D Vision (3DV).

[10]  M. Kocabas,et al.  Reconstructing Action-Conditioned Human-Object Interactions Using Commonsense Knowledge Priors , 2022, 2022 International Conference on 3D Vision (3DV).

[11]  Zhongang Cai,et al.  MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Mohamed A. Elgharib,et al.  A Motion Matching-based Framework for Controllable Gesture Synthesis from Speech , 2022, SIGGRAPH.

[13]  Siyu Tang,et al.  Compositional Human-Scene Interaction Synthesis with Semantic Control , 2022, ECCV.

[14]  Chen Change Loy,et al.  BRACE: The Breakdancing Competition Dataset for Dance Motion Synthesis , 2022, ECCV.

[15]  Sen Wang,et al.  TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts , 2022, ECCV.

[16]  Sen Wang,et al.  Generating Diverse and Natural 3D Human Motions from Text , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Dahua Lin,et al.  Towards Diverse and Natural Scene-aware 3D Human Motion Synthesis , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Xiangjun Tang,et al.  Real-time controllable motion transition for characters , 2022, ACM Trans. Graph..

[19]  Michael J. Black,et al.  TEMOS: Generating diverse human motions from textual descriptions , 2022, ECCV.

[20]  Amit H. Bermano,et al.  MotionCLIP: Exposing Human Motion Generation to CLIP Space , 2022, ECCV.

[21]  M. Pavone,et al.  Motron: Multimodal Probabilistic Human Motion Forecasting , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Shi-hong Xia,et al.  Spatio-Temporal Gating-Adjacency GCN for Human Motion Prediction , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Sungjoon Choi,et al.  Conditional Motion In-betweening , 2022, Pattern Recognit..

[24]  Siyu Tang,et al.  The Wanderings of Odysseus in 3D Scenes , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Alexandre Alahi,et al.  Towards Robust and Adaptive Motion Forecasting: A Causal Representation Perspective , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Ruben Villegas,et al.  Stochastic Scene-Aware Motion Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Nikos Athanasiou,et al.  BABEL: Bodies, Action and Behavior with English Labels , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Joachim Tesch,et al.  AGORA: Avatars in Geography Optimized for Regression Analysis , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Michael J. Black,et al.  Action-Conditioned 3D Human Motion Synthesis with Transformer VAE , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  C. Theobalt,et al.  Synthesis of Compositional Animations from Textual Descriptions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  David A. Ross,et al.  AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Shihao Zou,et al.  Action2Motion: Conditioned Generation of 3D Human Motions , 2020, ACM Multimedia.

[33]  Derek Nowrouzezahrai,et al.  Robust motion in-betweening , 2020, ACM Trans. Graph..

[34]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[35]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[36]  Jimei Yang,et al.  Generative Tweening: Long-term Inbetweening of 3D Human Motions , 2020, ArXiv.

[37]  Anoop Cherian,et al.  Inferring Temporal Compositions of Actions Using Probabilistic Automata , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[38]  Kris M. Kitani,et al.  DLow: Diversifying Latent Flows for Diverse Human Motion Prediction , 2020, ECCV.

[39]  Michael I. Jordan,et al.  Decision-Making with Auto-Encoding Variational Bayes , 2020, NeurIPS.

[40]  C. Schmid,et al.  Synthetic Humans for Action Recognition from Unseen Viewpoints , 2019, International Journal of Computer Vision.

[41]  Sebastian Starke,et al.  Neural state machine for character-scene interactions , 2019, ACM Trans. Graph..

[42]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[43]  B. Anderson Discrete Representations , 2019, Field Guide to Quantum Mechanics.

[44]  Abhinav Gupta,et al.  Compositional Video Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Louis-Philippe Morency,et al.  Language2Pose: Natural Language Grounded Pose Forecasting , 2019, 2019 International Conference on 3D Vision (3DV).

[46]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[47]  Michael J. Black,et al.  Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Francesc Moreno-Noguer,et al.  Context-Aware Human Motion Prediction , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Yi Zhou,et al.  On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Yin Li,et al.  Compositional Learning for Human Object Interaction , 2018, ECCV.

[52]  Jianfei Cai,et al.  Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features , 2018, ECCV.

[53]  Zicheng Liu,et al.  HP-GAN: Probabilistic 3D Human Motion Prediction via GAN , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[54]  Timothy Ha,et al.  Text2Action: Generative Adversarial Synthesis from Language to Action , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[55]  Larry S. Davis,et al.  Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[56]  Martial Hebert,et al.  From Red Wine to Red Tomato: Composition with Context , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Taku Komura,et al.  A Recurrent Variational Autoencoder for Human Motion Synthesis , 2017, BMVC.

[58]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Tamim Asfour,et al.  The KIT Motion-Language Dataset , 2016, Big Data.

[62]  Taku Komura,et al.  A Deep Learning Framework for Character Motion Synthesis and Editing , 2016, ACM Trans. Graph..

[63]  Cordelia Schmid,et al.  MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild , 2016, NIPS.

[64]  Zhenhua Wang,et al.  Synthesizing Training Images for Boosting Human 3D Pose Estimation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[65]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[66]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[67]  Bernt Schiele,et al.  Learning people detection models from few training samples , 2011, CVPR 2011.

[68]  Michael J. Black,et al.  Representing cyclic human motion using functional analysis , 2005, Image Vis. Comput..

[69]  David A. Forsyth,et al.  Motion synthesis from annotations , 2003, ACM Trans. Graph..

[70]  Ken Shoemake,et al.  Animating rotation with quaternion curves , 1985, SIGGRAPH.

[71]  Ravi Kiran Sarvadevabhatla,et al.  Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Zero Shot Action Generation , 2022, ArXiv.

[72]  Angela S. Lin,et al.  Generating Animated Videos of Human Activities from Natural Language Descriptions , 2018 .