AMD: Autoregressive Motion Diffusion

Human motion generation aims to produce plausible human motion sequences according to various conditional inputs, such as text or audio. Despite the feasibility of existing methods in generating motion based on short prompts and simple motion patterns, they encounter difficulties when dealing with long prompts or complex motions. The challenges are two-fold: 1) the scarcity of human motion-captured data for long prompts and complex motions. 2) the high diversity of human motions in the temporal domain and the substantial divergence of distributions from conditional modalities, leading to a many-to-many mapping problem when generating motion with complex and long texts. In this work, we address these gaps by 1) elaborating the first dataset pairing long textual descriptions and 3D complex motions (HumanLong3D), and 2) proposing an autoregressive motion diffusion model (AMD). Specifically, AMD integrates the text prompt at the current timestep with the text prompt and action sequences at the previous timestep as conditional information to predict the current action sequences in an iterative manner. Furthermore, we present its generalization for X-to-Motion with"No Modality Left Behind", enabling for the first time the generation of high-definition and high-fidelity human motions based on user-defined modality input.

[1]  Yan Huang,et al.  VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yixuan Shen,et al.  Zero3D: Semantic-Driven Multi-Category 3D Shape Generation , 2023, arXiv.org.

[3]  Yong Zhang,et al.  T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations , 2023, ArXiv.

[4]  Gang Yu,et al.  Executing your Commands via Motion Diffusion in Latent Space , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  C. K. Liu,et al.  EDGE: Editable Dance Generation From Music , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Amit H. Bermano,et al.  Human Motion Diffusion Model , 2022, ICLR.

[7]  Michael J. Black,et al.  TEACH: Temporal Action Composition for 3D Humans , 2022, 2022 International Conference on 3D Vision (3DV).

[8]  Zhongang Cai,et al.  MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[10]  Sen Wang,et al.  Generating Diverse and Natural 3D Human Motions from Text , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[12]  Chen Change Loy,et al.  HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling , 2022, ECCV.

[13]  Michael J. Black,et al.  TEMOS: Generating diverse human motions from textual descriptions , 2022, ECCV.

[14]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[15]  Chen Change Loy,et al.  Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Amit H. Bermano,et al.  MotionCLIP: Exposing Human Motion Generation to CLIP Space , 2022, ECCV.

[17]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Daniel Cohen-Or,et al.  Rhythm is a Dancer: Music-Driven Motion Synthesis With Global Structure , 2021, IEEE Transactions on Visualization and Computer Graphics.

[19]  Lu Sheng,et al.  DanceFormer: Music Conditioned 3D Dance Generation with Parametric Motion Transformer , 2021, AAAI.

[20]  Congyi Wang,et al.  Music2Dance: DanceNet for Music-Driven Dance Generation , 2020, ACM Trans. Multim. Comput. Commun. Appl..

[21]  Nikos Athanasiou,et al.  BABEL: Bodies, Action and Behavior with English Labels , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[23]  David A. Ross,et al.  AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[25]  Mohan S. Kankanhalli,et al.  DeepDance: Music-to-Dance Motion Choreography With Adversarial Learning , 2020, IEEE Transactions on Multimedia.

[26]  Ruozi Huang,et al.  Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning , 2021, ICLR.

[27]  Shihao Zou,et al.  Action2Motion: Conditioned Generation of 3D Human Motions , 2020, ACM Multimedia.

[28]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[29]  Jae Shin Yoon,et al.  HUMBI: A Large Multiview Dataset of Human Body Expressions , 2018, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Louis-Philippe Morency,et al.  Language2Pose: Natural Language Grounded Pose Forecasting , 2019, 2019 International Conference on 3D Vision (3DV).

[31]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Dario Pavllo,et al.  Modeling Human Motion with Quaternion-Based Neural Networks , 2019, International Journal of Computer Vision.

[33]  C. Lee Giles,et al.  A Neural Temporal Model for Human Motion Prediction , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jia Jia,et al.  Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis , 2018, ACM Multimedia.

[35]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[36]  Yingtao Tian,et al.  Towards the Automatic Anime Characters Creation with Generative Adversarial Networks , 2017, ArXiv.

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  P. Pasquier,et al.  GrooveNet : Real-Time Music-Driven Dance Movement Generation using Artificial Neural Networks , 2017 .

[39]  Taku Komura,et al.  A Deep Learning Framework for Character Motion Synthesis and Editing , 2016, ACM Trans. Graph..

[40]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[41]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[42]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[43]  Jaeheung Park,et al.  Music similarity-based approach to generating dance motion sequence , 2013, Multimedia Tools and Applications.

[44]  Weidong Geng,et al.  Example-Based Automatic Music-Driven Conventional Dance Motion Synthesis , 2012, IEEE Transactions on Visualization and Computer Graphics.

[45]  David A. Forsyth,et al.  Generalizing motion edits with Gaussian processes , 2009, ACM Trans. Graph..

[46]  Jessica K. Hodgins,et al.  Construction and optimal search of interpolated motion graphs , 2007, ACM Trans. Graph..

[47]  Michael J. Black,et al.  Representing cyclic human motion using functional analysis , 2005, Image Vis. Comput..

[48]  Tomohiko Mukai,et al.  Geostatistical motion interpolation , 2005, ACM Trans. Graph..

[49]  Eduardo de Campos Valadares,et al.  Dancing to the music , 2000 .

[50]  Dariu Gavrila,et al.  The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[51]  Michael F. Cohen,et al.  Verbs and Adverbs: Multidimensional Motion Interpolation , 1998, IEEE Computer Graphics and Applications.

[52]  Norman I. Badler,et al.  Simulating humans: computer graphics animation and control , 1993 .

[53]  J. O'Rourke,et al.  Model-based image analysis of human motion using constraint propagation , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.