Bidirectional Transformer GAN for Long-Term Human Motion Prediction

The mainstream motion prediction methods usually focus on short-term prediction, and their predicted long-term motions often fall into an average pose, i.e. the freezing forecasting problem [27]. To mitigate this problem, we propose a novel Bidirectional Transformer-based Generative Adversarial Network (BiTGAN) for long-term human motion prediction. The bidirectional setup leads to consistent and smooth generation in both forward and backward directions. Besides, to make full use of the history motions, we split them into two parts. The first part is fed to the Transformer encoder in our BiTGAN while the second part is used as the decoder input. This strategy can alleviate the exposure problem [37]. Additionally, to better maintain both the local (i.e., frame-level pose) and global (i.e., video-level semantic) similarities between the predicted motion sequence and the real one, the soft dynamic time warping (Soft-DTW) loss is introduced into the generator. Finally, we utilize a dual-discriminator to distinguish the predicted sequence at both frame and sequence levels. Extensive experiments on the public Human3.6M dataset demonstrate that our proposed BiTGAN achieves state-of-the-art performance on long-term (4s) human motion prediction, and reduces the average error of all actions by \(4\% \) .

[1]  Jingkuan Song,et al.  Scenario-Aware Recurrent Transformer for Goal-Directed Video Captioning , 2022, ACM Trans. Multim. Comput. Commun. Appl..

[2]  Fahad Shahbaz Khan,et al.  Transformers in Vision: A Survey , 2021, ACM Comput. Surv..

[3]  Francesco G. B. De Natale,et al.  Where Are They Going? Predicting Human Behaviors in Crowded Scenes , 2021, ACM Trans. Multim. Comput. Commun. Appl..

[4]  Yifang Yin,et al.  Motion Prediction via Joint Dependency Modeling in Phase Space , 2021, ACM Multimedia.

[5]  Zhenguang Liu,et al.  Learning Human Motion Prediction via Stochastic Differential Equations , 2021, ACM Multimedia.

[6]  Fabio Galasso,et al.  Space-Time-Separable Graph Convolutional Network for Pose Forecasting , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Kaifeng Gao,et al.  Video Relation Detection via Tracklet based Visual Transformer , 2021, ACM Multimedia.

[8]  Yongwei Nie,et al.  MSR-GCN: Multi-Scale Residual Graph Convolution Networks for Human Motion Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Masayoshi Tomizuka,et al.  RAIN: Reinforced Hybrid Attention Inference Network for Motion Forecasting , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Bolei Zhou,et al.  Multimodal Motion Prediction with Stacked Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Zhengxia Zou,et al.  Single-Shot Motion Completion with Transformer , 2021, ArXiv.

[12]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[13]  Andrea Esuli,et al.  Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders , 2020, ACM Trans. Multim. Comput. Commun. Appl..

[14]  David A. Ross,et al.  Learn to Dance with AIST++: Music Conditioned 3D Dance Generation , 2021, ArXiv.

[15]  Qifeng Chen,et al.  Self-supervised Dance Video Synthesis Conditioned on Music , 2020, ACM Multimedia.

[16]  Sanja Fidler,et al.  Learning to Generate Diverse Dance Motions with Transformer , 2020, ArXiv.

[17]  Mathieu Salzmann,et al.  History Repeats Itself: Human Motion Prediction via Motion Attention , 2020, ECCV.

[18]  Nicu Sebe,et al.  XingGAN for Person Image Generation , 2020, ECCV.

[19]  Minh Vo,et al.  Long-term Human Motion Prediction with Scene Context , 2020, ECCV.

[20]  Yanfeng Wang,et al.  Dynamic Multiscale Graph Neural Networks for 3D Skeleton Based Human Motion Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Nadia Magnenat-Thalmann,et al.  Learning Progressive Joint Propagation for Human Motion Prediction , 2020, ECCV.

[22]  Juan Carlos Niebles,et al.  Imitation Learning for Human Pose Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Hongdong Li,et al.  Learning Trajectory Dependencies for Human Motion Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[25]  Roger Zimmermann,et al.  Towards Natural and Accurate Future Motion Prediction of Humans and Animals , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[27]  R. Venkatesh Babu,et al.  BiHMP-GAN: Bidirectional 3D Human Motion Prediction GAN , 2018, AAAI.

[28]  C. Lee Giles,et al.  A Neural Temporal Model for Human Motion Prediction , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[30]  Nicu Sebe,et al.  Dual Generator Generative Adversarial Networks for Multi-Domain Image-to-Image Translation , 2018, ACCV.

[31]  José M. F. Moura,et al.  Adversarial Geometry-Aware Human Motion Prediction , 2018, ECCV.

[32]  Bodo Rosenhahn,et al.  Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[33]  Wei Liu,et al.  Long-Term Human Motion Prediction by Modeling Motion Context and Enhancing Motion Dynamic , 2018, IJCAI.

[34]  Zhen Zhang,et al.  Convolutional Sequence to Sequence Model for Human Dynamics , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Lihui Wang,et al.  Human motion prediction for human-robot collaboration , 2017 .

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Marco Cuturi,et al.  Soft-DTW: a Differentiable Loss Function for Time-Series , 2017, ICML.

[39]  Danica Kragic,et al.  Deep Representation Learning for Human Motion Prediction and Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[41]  Dumitru Erhan,et al.  Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[43]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[47]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Konrad Paul Kording,et al.  The statistics of natural hand movements , 2008, Experimental Brain Research.