Everybody Dance Now

This paper presents a simple method for “do as I do” motion transfer: given a source video of a person dancing, we can transfer that performance to a novel (amateur) target after only a few minutes of the target subject performing standard moves. We approach this problem as video-to-video translation using pose as an intermediate representation. To transfer the motion, we extract poses from the source subject and apply the learned pose-to-appearance mapping to generate the target subject. We predict two consecutive frames for temporally coherent video results and introduce a separate pipeline for realistic face synthesis. Although our method is quite simple, it produces surprisingly compelling results (see video). This motivates us to also provide a forensics tool for reliable synthetic content detection, which is able to distinguish videos synthesized by our system from real data. In addition, we release a first-of-its-kind open-source dataset of videos that can be legally used for training and motion transfer.

[1]  Frédo Durand,et al.  Synthesizing Images of Humans in Unseen Poses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Ruben Villegas,et al.  Neural Kinematic Networks for Unsupervised Motion Retargetting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Peter V. Gehler,et al.  A Generative Model of People in Clothing , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[6]  Yaser Sheikh,et al.  Recycle-GAN: Unsupervised Video Retargeting , 2018, ECCV.

[7]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[8]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[9]  Dani Lischinski,et al.  Deep Video‐Based Performance Cloning , 2018, Comput. Graph. Forum.

[10]  Yong Man Ro,et al.  Dynamics Transfer GAN: Generating Video by Transferring Arbitrary Temporal Dynamics from a Source Video to a Single Target Image , 2017, ArXiv.

[11]  Vladlen Koltun,et al.  Photographic Image Synthesis with Cascaded Refinement Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[13]  Martial Hebert,et al.  The Pose Knows: Video Forecasting by Generating Pose Futures , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Björn Ommer,et al.  A Variational U-Net for Conditional Appearance and Shape Generation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Cristian Sminchisescu,et al.  Human Appearance Transfer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Takeo Kanade,et al.  Markerless human motion transfer , 2004, Proceedings. 2nd International Symposium on 3D Data Processing, Visualization and Transmission, 2004. 3DPVT 2004..

[18]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Philip H. S. Torr,et al.  A Semi-supervised Deep Generative Model for Human Body Analysis , 2018, ECCV Workshops.

[20]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Chris Hecker,et al.  Real-time motion retargeting to highly varied user-created morphologies , 2008, ACM Trans. Graph..

[23]  Jitendra Malik,et al.  Video Based Motion Synthesis by Splicing and Morphing , 2004 .

[24]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[26]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Steven M. Seitz,et al.  LookinGood , 2018, ACM Trans. Graph..

[28]  Nicu Sebe,et al.  Deformable GANs for Pose-Based Human Image Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Hans-Peter Seidel,et al.  Video-based characters: creating new human performances from a multi-view video database , 2011, ACM Trans. Graph..

[30]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[31]  Sung Yong Shin,et al.  A hierarchical approach to interactive motion editing for human-like figures , 1999, SIGGRAPH.

[32]  JeheeLee SungYongShin A Hierarchical Approach to Interactive Motion Editing for Human-like Figures , 1999 .

[33]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[34]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Hyunsoo Kim,et al.  Learning to Discover Cross-Domain Relations with Generative Adversarial Networks , 2017, ICML.

[36]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[37]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Jan Kautz,et al.  Unsupervised Image-to-Image Translation Networks , 2017, NIPS.

[39]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Philip H. S. Torr,et al.  A Conditional Deep Generative Model of People in Natural Images , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[41]  Luc Van Gool,et al.  Disentangled Person Image Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Christian Theobalt,et al.  Neural Rendering and Reenactment of Human Actor Videos , 2018, ACM Trans. Graph..

[44]  Ruben Villegas,et al.  Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[45]  Luc Van Gool,et al.  Pose Guided Person Image Generation , 2017, NIPS.

[46]  Junmo Kim,et al.  Generating a Fusion Image: One's Identity and Another's Shape , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Michael Gleicher,et al.  Retargetting motion to new characters , 1998, SIGGRAPH.

[48]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[49]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[50]  Patrick Pérez,et al.  Deep video portraits , 2018, ACM Trans. Graph..

[51]  Ming-Yu Liu,et al.  Coupled Generative Adversarial Networks , 2016, NIPS.

[52]  Ju Shen,et al.  Automatic Pose Tracking and Motion Transfer to Arbitrary 3D Characters , 2015, ICIG.

[53]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[54]  Adrian Hilton,et al.  4D video textures for interactive character appearance , 2014, Comput. Graph. Forum.