Flow Guided Transformable Bottleneck Networks for Motion Retargeting

Human motion retargeting aims to transfer the motion of one person in a "driving" video or set of images to another person. Existing efforts leverage a long training video from each target person to train a subject-specific motion transfer model. However, the scalability of such methods is limited, as each model can only generate videos for the given target subject, and such training videos are labor-intensive to acquire and process. Few-shot motion transfer techniques, which only require one or a few images from a target, have recently drawn considerable attention. Methods addressing this task generally use either 2D or explicit 3D representations to transfer motion, and in doing so, sacrifice either accurate geometric modeling or the flexibility of an end-to-end learned representation. Inspired by the Transformable Bottleneck Network, which renders novel views and manipulations of rigid objects, we propose an approach based on an implicit volumetric representation of the image content, which can then be spatially manipulated using volumetric flow fields. We address the challenging question of how to aggregate information across different body poses, learning flow fields that allow for combining content from the appropriate regions of input images of highly non-rigid human subjects performing complex motions into a single implicit volumetric representation. This allows us to learn our 3D representation solely from videos of moving people. Armed with both 3D object understanding and end-to-end learned rendering, this categorically novel representation delivers state-of-the-art image generation quality, as shown by our quantitative and qualitative evaluations.

[1]  Sergey Tulyakov,et al.  Human Motion Transfer from Poses in the Wild , 2020, ECCV Workshops.

[2]  Jonathan T. Barron,et al.  NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Björn Ommer,et al.  Unsupervised Part-Based Disentangling of Object Shape and Appearance , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[5]  Victor Lempitsky,et al.  Coordinate-Based Texture Inpainting for Pose-Guided Human Image Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  James F. O'Brien,et al.  Interpolating and approximating implicit surfaces from polygon soup , 2004, SIGGRAPH Courses.

[7]  Hanbyul Joo,et al.  PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Seonghyeon Nam,et al.  Unsupervised Keypoint Learning for Guiding Class-Conditional Video Prediction , 2019, NeurIPS.

[10]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[11]  Nicu Sebe,et al.  Deformable GANs for Pose-Based Human Image Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Luc Van Gool,et al.  Disentangled Person Image Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Gordon Wetzstein,et al.  DeepVoxels: Learning Persistent 3D Feature Embeddings , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Kyaw Zaw Lin,et al.  Neural Sparse Voxel Fields , 2020, NeurIPS.

[15]  Lior Wolf,et al.  Vid2Game: Controllable Characters Extracted from Real-World Videos , 2019, ICLR.

[16]  Jitendra Malik,et al.  Learning a Multi-View Stereo Machine , 2017, NIPS.

[17]  Leonid Sigal,et al.  DwNet: Dense warp-based network for pose-guided human video generation , 2019, BMVC.

[18]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Iasonas Kokkinos,et al.  Dense Pose Transfer , 2018, ECCV.

[20]  Michael J. Black,et al.  Learning to Dress 3D People in Generative Clothing , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Björn Ommer,et al.  A Variational U-Net for Conditional Appearance and Shape Generation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Hao Li,et al.  PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Tao Mei,et al.  Unsupervised Person Image Generation With Semantic Parsing Transformation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Chaitanya Patel,et al.  TailorNet: Predicting Clothing in 3D as a Function of Human Pose, Shape and Garment Style , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jan Kautz,et al.  Few-shot Video-to-Video Synthesis , 2019, NeurIPS.

[26]  Victor Lempitsky,et al.  Few-Shot Adversarial Learning of Realistic Neural Talking Head Models , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Gerard Pons-Moll,et al.  360-Degree Textures of People in Clothing from a Single Image , 2019, 2019 International Conference on 3D Vision (3DV).

[28]  Chen Huang,et al.  Dense Intrinsic Appearance Flow for Human Pose Transfer , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Nicu Sebe,et al.  XingGAN for Person Image Generation , 2020, ECCV.

[30]  Francesc Moreno-Noguer,et al.  Unsupervised Person Image Synthesis in Arbitrary Poses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Nicu Sebe,et al.  First Order Motion Model for Image Animation , 2020, NeurIPS.

[32]  Christian Theobalt,et al.  Multi-Garment Net: Learning to Dress 3D People From Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Haibin Shen,et al.  GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer , 2019 .

[34]  Martin Kersner,et al.  MarioNETte: Few-shot Face Reenactment Preserving Identity of Unseen Targets , 2019, AAAI.

[35]  Bolei Zhou,et al.  TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Wei Wang,et al.  Multistage Adversarial Losses for Pose-Based Human Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Thomas H. Li,et al.  Deep Spatial Transformation for Pose-Guided Person Image Generation and Animation , 2020, IEEE Transactions on Image Processing.

[38]  Thomas Huang,et al.  FLNet: Landmark Driven Fetching and Learning Network for Faithful Talking Facial Animation Synthesis , 2019, AAAI.

[39]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Aaron Knoll,et al.  A Survey of Implicit Surface Rendering Methods, and a Proposal for a Common Sampling Framework , 2007, Visualization of Large and Unstructured Data Sets.

[41]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Gerard Pons-Moll,et al.  Learning to Transfer Texture From Clothing Images to 3D Humans , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Fumin Shen,et al.  Make a Face: Towards Arbitrary High Fidelity Face Manipulation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Deva Ramanan,et al.  MetaPix: Few-Shot Video Retargeting , 2020, ICLR.

[48]  Victor Lempitsky,et al.  Fast Bi-layer Neural Synthesis of One-Shot Realistic Head Avatars , 2020, ECCV.

[49]  Luc Van Gool,et al.  Pose Guided Person Image Generation , 2017, NIPS.

[50]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Richard K. Beatson,et al.  Reconstruction and representation of 3D objects with radial basis functions , 2001, SIGGRAPH.

[52]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[53]  Chen Fang,et al.  Dance Dance Generation: Motion Transfer for Internet Videos , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[54]  Qi Tian,et al.  Scalable Person Re-identification: A Benchmark , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[55]  Hao Li,et al.  Monocular Real-Time Volumetric Performance Capture , 2020, ECCV.

[56]  Hanjiang Lai,et al.  Soft-Gated Warping-GAN for Pose-Guided Person Image Synthesis , 2018, NeurIPS.

[57]  Sergey Tulyakov,et al.  Transformable Bottleneck Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[58]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[59]  Andrea Cavallaro,et al.  Pose Guided Human Image Synthesis by View Disentanglement and Enhanced Weighting Loss , 2018, ECCV Workshops.

[60]  Nicu Sebe,et al.  Animating Arbitrary Objects via Deep Motion Transfer , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Nicu Sebe,et al.  Motion-supervised Co-Part Segmentation , 2020, ArXiv.

[62]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[64]  Thomas H. Li,et al.  Deep Image Spatial Transformation for Person Image Generation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Bjorn Ommer,et al.  Unsupervised Robust Disentangling of Latent Characteristics for Image Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[66]  Bastian Leibe,et al.  Reposing Humans by Warping 3D Features , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[67]  Frédo Durand,et al.  Synthesizing Images of Humans in Unseen Poses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[68]  Wenhan Luo,et al.  Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[69]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[70]  Dani Lischinski,et al.  Deep Video‐Based Performance Cloning , 2018, Comput. Graph. Forum.

[71]  Miao Yu,et al.  Progressive Pose Attention Transfer for Person Image Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).