Human MotionFormer: Transferring Human Motions with Vision Transformers

Human motion transfer aims to transfer motions from a target dynamic person to a source static one for motion synthesis. An accurate matching between the source person and the target motion in both large and subtle motion changes is vital for improving the transferred motion quality. In this paper, we propose Human MotionFormer, a hierarchical ViT framework that leverages global and local perceptions to capture large and subtle motion matching, respectively. It consists of two ViT encoders to extract input features (i.e., a target motion image and a source human image) and a ViT decoder with several cascaded blocks for feature matching and motion transfer. In each block, we set the target motion feature as Query and the source person as Key and Value, calculating the cross-attention maps to conduct a global feature matching. Further, we introduce a convolutional layer to improve the local perception after the global cross-attention computations. This matching process is implemented in both warping and generation branches to guide the motion transfer. During training, we propose a mutual learning loss to enable the co-supervision between warping and generation branches for better motion representations. Experiments show that our Human MotionFormer sets the new state-of-the-art performance both qualitatively and quantitatively. Project page: \url{https://github.com/KumapowerLIU/Human-MotionFormer}

[1]  Qifeng Chen,et al.  One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations , 2022, NeurIPS.

[2]  Jiangliu Wang,et al.  AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition , 2022, NeurIPS.

[3]  Yuning Jiang,et al.  Structure-Aware Motion Transfer with Deformable Anchor Model , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Hui Zhang,et al.  Thin-Plate Spline Motion Model for Image Animation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  P. Xie,et al.  Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations , 2022, ArXiv.

[6]  Li Dong,et al.  Swin Transformer V2: Scaling Up Capacity and Resolution , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Victor Lempitsky,et al.  Resolution-robust Large Mask Inpainting with Fourier Convolutions , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[8]  Zhuowen Tu,et al.  ViTGAN: Training GANs with Vision Transformers , 2021, ICLR.

[9]  Nenghai Yu,et al.  CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jianmin Bao,et al.  Uformer: A General U-Shaped Transformer for Image Restoration , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Qi Tian,et al.  Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation , 2021, ECCV Workshops.

[12]  Ping Luo,et al.  Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning , 2021, ArXiv.

[13]  Chen Change Loy,et al.  3D Human Texture Estimation from a Single Image with Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Luc Van Gool,et al.  SwinIR: Image Restoration Using Swin Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[15]  Thomas H. Li,et al.  Combining Attention with Flow for Person Image Synthesis , 2021, ACM Multimedia.

[16]  Zilong Huang,et al.  Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer , 2021, ArXiv.

[17]  Yanwei Fu,et al.  The Image Local Autoregressive Transformer , 2021, NeurIPS.

[18]  Yibing Song,et al.  PD-GAN: Probabilistic Diverse GAN for Image Inpainting , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Sergey Tulyakov,et al.  Motion Representations for Articulated Animation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Quanfu Fan,et al.  CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Xintong Han,et al.  Few-Shot Human Motion Transfer by Personalized Geometry and Texture Modeling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jing Liao,et al.  High-Fidelity Pluralistic Image Completion with Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Yibing Song,et al.  DeFLOCNet: Deep Image Editing via Flexible Low-level Controls , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Dong Liu,et al.  Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Xiang Li,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[29]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Lior Wolf,et al.  Single-Shot Freestyle Dance Reenactment , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Wen Gao,et al.  Pre-Trained Image Processing Transformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[33]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Shiyu Chang,et al.  TransGAN: Two Transformers Can Make One Strong GAN , 2021, ArXiv.

[35]  Wei Huang,et al.  Rethinking Image Inpainting via a Mutual Encoder-Decoder with Feature Equalizations , 2020, ECCV.

[36]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[37]  Thomas H. Li,et al.  Deep Image Spatial Transformation for Person Image Generation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Nicu Sebe,et al.  First Order Motion Model for Image Animation , 2020, NeurIPS.

[39]  Jan Kautz,et al.  Few-shot Video-to-Video Synthesis , 2019, NeurIPS.

[40]  Weilin Huang,et al.  ClothFlow: A Flow-Based Model for Clothed Person Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Wenhan Luo,et al.  Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Victor Lempitsky,et al.  Coordinate-Based Texture Inpainting for Pose-Guided Human Image Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Bin Jiang,et al.  Coherent Semantic Attention for Image Inpainting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Victor Lempitsky,et al.  Textured Neural Avatars , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  V. Lempitsky,et al.  Few-Shot Adversarial Learning of Realistic Neural Talking Head Models , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Nicu Sebe,et al.  Animating Arbitrary Objects via Deep Motion Transfer , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Hanjiang Lai,et al.  Soft-Gated Warping-GAN for Pose-Guided Person Image Synthesis , 2018, NeurIPS.

[49]  Iasonas Kokkinos,et al.  Dense Pose Transfer , 2018, ECCV.

[50]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[51]  Xiaoou Tang,et al.  LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Frédo Durand,et al.  Synthesizing Images of Humans in Unseen Poses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Björn Ommer,et al.  A Variational U-Net for Conditional Appearance and Shape Generation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  Haoyu Wang,et al.  Pose Flow: Efficient Online Pose Tracking , 2018, BMVC.

[56]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Luc Van Gool,et al.  Disentangled Person Image Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[61]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[62]  Luc Van Gool,et al.  Pose Guided Person Image Generation , 2017, NIPS.

[63]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[66]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[67]  Leon A. Gatys,et al.  A Neural Algorithm of Artistic Style , 2015, ArXiv.

[68]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[69]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[70]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.