Single-Shot Freestyle Dance Reenactment

The task of motion transfer between a source dancer and a target person is a special case of the pose transfer problem, in which the target person changes their pose in accordance with the motions of the dancer. In this work, we propose a novel method that can reanimate a single image by arbitrary video sequences, unseen during training. The method combines three networks: (i) a segmentation-mapping network, (ii) a realistic frame-rendering network, and (iii) a face refinement network. By separating this task into three stages, we are able to attain a novel sequence of realistic frames, capturing natural motion and appearance. Our method obtains significantly better visual quality than previous methods and is able to animate diverse body types and appearances, which are captured in challenging poses, as shown in the experiments and supplementary video.

[1]  Sergey Tulyakov,et al.  Human Motion Transfer from Poses in the Wild , 2020, ECCV Workshops.

[2]  Björn Ommer,et al.  Towards Learning a Realistic Rendering of Human Behavior , 2018, ECCV Workshops.

[3]  Yi Yang,et al.  Self-Correction for Human Parsing , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Miao Yu,et al.  Progressive Pose Attention Transfer for Person Image Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Wenhan Luo,et al.  Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Luc Van Gool,et al.  Pose Guided Person Image Generation , 2017, NIPS.

[8]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[9]  Yu Cheng,et al.  Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing , 2018, ACM Multimedia.

[10]  Nicu Sebe,et al.  Deformable GANs for Pose-Based Human Image Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Hanjiang Lai,et al.  Soft-Gated Warping-GAN for Pose-Guided Person Image Synthesis , 2018, NeurIPS.

[12]  Pushmeet Kohli,et al.  PerforatedCNNs: Acceleration through Elimination of Redundant Convolutions , 2015, NIPS.

[13]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Liang Lin,et al.  Adaptive Temporal Encoding Network for Video Instance-level Human Parsing , 2018, ACM Multimedia.

[16]  Cristian Canton-Ferrer,et al.  The DeepFake Detection Challenge Dataset , 2020, ArXiv.

[17]  Björn Ommer,et al.  A Variational U-Net for Conditional Appearance and Shape Generation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Thomas S. Huang,et al.  Generative Image Inpainting with Contextual Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Yuning Jiang,et al.  Controllable Person Image Synthesis With Attribute-Decomposed GAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Qi Tian,et al.  Scalable Person Re-identification: A Benchmark , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Longhui Wei,et al.  Person Transfer GAN to Bridge Domain Gap for Person Re-identification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[24]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[25]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[27]  Gokul Swamy,et al.  Generative Models for Pose Transfer , 2018, ArXiv.

[28]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[29]  Ming Yang,et al.  Instance-level Human Parsing via Part Grouping Network , 2018, ECCV.

[30]  Francesc Moreno-Noguer,et al.  Unsupervised Person Image Synthesis in Arbitrary Poses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Thomas S. Huang,et al.  Free-Form Image Inpainting With Gated Convolution , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Georgios Tzimiropoulos,et al.  How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Chen Fang,et al.  Dance Dance Generation: Motion Transfer for Internet Videos , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[35]  Nicu Sebe,et al.  First Order Motion Model for Image Animation , 2020, NeurIPS.

[36]  Bolei Zhou,et al.  TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[39]  Frédo Durand,et al.  Synthesizing Images of Humans in Unseen Poses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Taesung Park,et al.  Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Lior Wolf,et al.  Vid2Game: Controllable Characters Extracted from Real-World Videos , 2019, ICLR.

[43]  Tao Mei,et al.  Unsupervised Person Image Generation With Semantic Parsing Transformation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Jan Kautz,et al.  Few-shot Video-to-Video Synthesis , 2019, NeurIPS.

[45]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[46]  Omkar M. Parkhi,et al.  VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[47]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.