HumanMAC: Masked Motion Completion for Human Motion Prediction

Human motion prediction is a classical problem in computer vision and computer graphics, which has a wide range of practical applications. Previous effects achieve great empirical performance based on an encoding-decoding style. The methods of this style work by first encoding previous motions to latent representations and then decoding the latent representations into predicted motions. However, in practice, they are still unsatisfactory due to several issues, including complicated loss constraints, cumbersome training processes, and scarce switch of different categories of motions in prediction. In this paper, to address the above issues, we jump out of the foregoing style and propose a novel framework from a new perspective. Specifically, our framework works in a masked completion fashion. In the training stage, we learn a motion diffusion model that generates motions from random noise. In the inference stage, with a denoising procedure, we make motion prediction conditioning on observed motions to output more continuous and controllable predictions. The proposed framework enjoys promising algorithmic properties, which only needs one loss in optimization and is trained in an end-to-end manner. Additionally, it accomplishes the switch of different categories of motions effectively, which is significant in realistic tasks, e.g., the animation task. Comprehensive experiments on benchmarks confirm the superiority of the proposed framework. The project page is available at https://lhchen.top/Human-MAC.

[1]  L. Zhang,et al.  Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset , 2023, NeurIPS.

[2]  Gang Yu,et al.  MotionGPT: Human Motion as a Foreign Language , 2023, NeurIPS.

[3]  P. Bartlett,et al.  Trained Transformers Learn Linear Models In-Context , 2023, ArXiv.

[4]  Ming-Ming Cheng,et al.  CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation , 2023, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Fengyu Yang,et al.  Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model , 2023, ArXiv.

[6]  Zhen Li,et al.  Semantic Human Parsing via Scalable Semantic Transfer Over Multiple Label Domains , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Lei Zhang,et al.  HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Lei Zhang,et al.  Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Yu-Xiong Wang,et al.  Diverse Human Motion Prediction Guided by Multi-level Spatial-Temporal Anchors , 2023, ECCV.

[10]  Sergio Valcarcel Macua,et al.  Imitating Human Behaviour with Diffusion Models , 2023, ICLR.

[11]  Yong Zhang,et al.  T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations , 2023, ArXiv.

[12]  Shenghua Gao,et al.  Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  C. Theobalt,et al.  MoFusion: A Framework for Denoising-Diffusion-Based Motion Synthesis , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Gang Yu,et al.  Executing your Commands via Motion Diffusion in Latent Space , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Baoyuan Wang,et al.  Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Baoyuan Wang,et al.  UDE: A Unified Driving Engine for Human Motion Generation , 2022, ArXiv.

[17]  Cristina Palmero,et al.  BeLFusion: Latent Diffusion for Behavior-Driven Human Motion Prediction , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  C. K. Liu,et al.  EDGE: Editable Dance Generation From Music , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  J. Beskow,et al.  Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models , 2022, ACM Trans. Graph..

[20]  P. Luo,et al.  DiffusionDet: Diffusion Model for Object Detection , 2022, ArXiv.

[21]  Cheng Lu,et al.  DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models , 2022, ArXiv.

[22]  Fabien Baradel,et al.  PoseGPT: Quantization-based 3D Human Motion Generation and Forecasting , 2022, ECCV.

[23]  Jianfeng Lu,et al.  Human Joint Kinematics Diffusion-Refinement for Stochastic Motion Prediction , 2022, AAAI.

[24]  Amit H. Bermano,et al.  Human Motion Diffusion Model , 2022, ICLR.

[25]  Zhongang Cai,et al.  MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Yongwei Nie,et al.  Diverse Human Motion Prediction via Gumbel-Softmax Sampling from an Auxiliary Space , 2022, ACM Multimedia.

[27]  U. Kressel,et al.  MotionMixer: MLP-based 3D Human Body Pose Forecasting , 2022, IJCAI.

[28]  Mingyuan Zhou,et al.  CARD: Classification and Regression Diffusion Models , 2022, NeurIPS.

[29]  Mingyuan Zhou,et al.  Diffusion-GAN: Training GANs with Diffusion , 2022, ICLR.

[30]  Cheng Lu,et al.  DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps , 2022, NeurIPS.

[31]  Sen Wang,et al.  Generating Diverse and Natural 3D Human Motions from Text , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Zhicheng Dou,et al.  BARS: Towards Open Benchmarking for Recommender Systems , 2022, SIGIR.

[33]  Chen Change Loy,et al.  HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling , 2022, ECCV.

[34]  Michael J. Black,et al.  TEMOS: Generating diverse human motions from textual descriptions , 2022, ECCV.

[35]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[36]  Chen Change Loy,et al.  Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Amit H. Bermano,et al.  MotionCLIP: Exposing Human Motion Generation to CLIP Space , 2022, ECCV.

[38]  M. Pavone,et al.  Motron: Multimodal Probabilistic Human Motion Forecasting , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  S. Ermon,et al.  GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation , 2022, ICLR.

[40]  Shi-hong Xia,et al.  Spatio-Temporal Gating-Adjacency GCN for Human Motion Prediction , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  L. Gool,et al.  RePaint: Inpainting using Denoising Diffusion Probabilistic Models , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[44]  Di-Hua Zhai,et al.  DMS-GCN: Dynamic Mutiscale Spatiotemporal Graph Convolutional Networks for Human Motion Prediction , 2021, ArXiv.

[45]  Karsten Kreis,et al.  Tackling the Generative Learning Trilemma with Denoising Diffusion GANs , 2021, ICLR.

[46]  A. Dimakis,et al.  Deblurring via Stochastic Refinement , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Fabio Galasso,et al.  Space-Time-Separable Graph Convolutional Network for Pose Forecasting , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Mathieu Salzmann,et al.  Generating Smooth Pose Sequences for Diverse Human Motion Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Ruben Villegas,et al.  Stochastic Scene-Aware Motion Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Nikos Athanasiou,et al.  BABEL: Bodies, Action and Behavior with English Labels , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Tasnima Sadekova,et al.  Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech , 2021, ICML.

[52]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[53]  Michael J. Black,et al.  Action-Conditioned 3D Human Motion Synthesis with Transformer VAE , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Juan Carlos Niebles,et al.  TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[55]  Lu Sheng,et al.  DanceFormer: Music Conditioned 3D Dance Generation with Parametric Motion Transformer , 2021, AAAI.

[56]  B. Ommer,et al.  Behavior-Driven Synthesis of Human Dynamics , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Prafulla Dhariwal,et al.  Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[58]  Huaijiang Sun,et al.  Efficient human motion prediction using temporal convolutional generative adversarial network , 2021, Inf. Sci..

[59]  David A. Ross,et al.  AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[60]  Michael J. Black,et al.  We are More than Our Joints: Predicting how 3D Bodies Move , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[62]  Bryan Catanzaro,et al.  DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.

[63]  Shihao Zou,et al.  Action2Motion: Conditioned Generation of 3D Human Motions , 2020, ACM Multimedia.

[64]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[65]  Lars Petersson,et al.  A Stochastic Conditioning Scheme for Diverse Human Motion Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Kris M. Kitani,et al.  DLow: Diversifying Latent Flows for Diverse Human Motion Prediction , 2020, ECCV.

[67]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[68]  Hongdong Li,et al.  Learning Trajectory Dependencies for Human Motion Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[69]  Jitendra Malik,et al.  Predicting 3D Human Dynamics From Video , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[70]  Kris Kitani,et al.  Diverse Trajectory Forecasting with Determinantal Point Processes , 2019, ICLR.

[71]  Francesc Moreno-Noguer,et al.  Context-Aware Human Motion Prediction , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[73]  Francesc Moreno-Noguer,et al.  Human Motion Prediction via Spatio-Temporal Inpainting , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[74]  Ersin Yumer,et al.  MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics , 2018, ECCV.

[75]  Bernt Schiele,et al.  Accurate and Diverse Sampling of Sequences Based on a "Best of Many" Sample Objective , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[76]  Wei Liu,et al.  Long-Term Human Motion Prediction by Modeling Motion Context and Enhancing Motion Dynamic , 2018, IJCAI.

[77]  Zhen Zhang,et al.  Convolutional Sequence to Sequence Model for Human Dynamics , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[78]  Silvio Savarese,et al.  Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[79]  Zicheng Liu,et al.  HP-GAN: Probabilistic 3D Human Motion Prediction via GAN , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[80]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[81]  Yi Zhou,et al.  Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis , 2017, ICLR.

[82]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[83]  Ravi Kiran Sarvadevabhatla,et al.  DeLiGAN: Generative Adversarial Networks for Diverse and Limited Data , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Philip H. S. Torr,et al.  DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[86]  Tamim Asfour,et al.  The KIT Motion-Language Dataset , 2016, Big Data.

[87]  Emilio Frazzoli,et al.  A Survey of Motion Planning and Control Techniques for Self-Driving Urban Vehicles , 2016, IEEE Transactions on Intelligent Vehicles.

[88]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[89]  Tamim Asfour,et al.  The KIT whole-body human motion database , 2015, 2015 International Conference on Advanced Robotics (ICAR).

[90]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[91]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[92]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[93]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[94]  N. Troje Decomposing biological motion: a framework for analysis and synthesis of human gait patterns. , 2002, Journal of vision.

[95]  Tongliang Liu,et al.  Moderate Coreset: A Universal Method of Data Selection for Real-world Data-efficient Deep Learning , 2023, ICLR.

[96]  Tongliang Liu,et al.  Out-of-Distribution Detection with An Adaptive Likelihood Ratio on Informative Hierarchical VAE , 2022, NeurIPS.