Deep Spatial Transformation for Pose-Guided Person Image Generation and Animation

Pose-guided person image generation and animation aim to transform a source person image to target poses. These tasks require spatial manipulation of source data. However, Convolutional Neural Networks are limited by the lack of ability to spatially transform the inputs. In this article, we propose a differentiable global-flow local-attention framework to reassemble the inputs at the feature level. This framework first estimates global flow fields between sources and targets. Then, corresponding local source feature patches are sampled with content-aware local attention coefficients. We show that our framework can spatially transform the inputs in an efficient manner. Meanwhile, we further model the temporal consistency for the person image animation task to generate coherent videos. The experiment results of both image generation and animation tasks demonstrate the superiority of our model. Besides, additional results of novel view synthesis and face image animation show that our model is applicable to other tasks requiring spatial transformation. The source code of our project is available at https://github.com/RenYurui/Global-Flow-Local-Attention.

[1]  Dani Lischinski,et al.  Deep Video‐Based Performance Cloning , 2018, Comput. Graph. Forum.

[2]  Jitendra Malik,et al.  View Synthesis by Appearance Flow , 2016, ECCV.

[3]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[4]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[5]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[7]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Thomas H. Li,et al.  StructureFlow: Image Inpainting via Structure-Aware Appearance Flow , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Luc Van Gool,et al.  Pose Guided Person Image Generation , 2017, NIPS.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Weilin Huang,et al.  ClothFlow: A Flow-Based Model for Clothed Person Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Nicu Sebe,et al.  First Order Motion Model for Image Animation , 2020, NeurIPS.

[13]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[14]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[16]  Chen Gao,et al.  PSGAN: Pose and Expression Robust Spatial-Aware GAN for Customizable Makeup Transfer , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Neil Genzlinger A. and Q , 2006 .

[18]  Nicu Sebe,et al.  Animating Arbitrary Objects via Deep Motion Transfer , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[21]  Miao Yu,et al.  Progressive Pose Attention Transfer for Person Image Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jan Kautz,et al.  Few-shot Video-to-Video Synthesis , 2019, NeurIPS.

[24]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[25]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[28]  Weiwei Sun,et al.  Linearized Multi-Sampling for Differentiable Image Transformation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[30]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[31]  Henryk Sienkiewicz,et al.  Quo Vadis? , 1967, American Association of Industrial Nurses journal.

[32]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[33]  Tao Mei,et al.  Unsupervised Person Image Generation With Semantic Parsing Transformation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Hao Zhu,et al.  CrowdPose: Efficient Crowded Scenes Pose Estimation and a New Benchmark , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Thomas H. Li,et al.  Deep Image Spatial Transformation for Person Image Generation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Тараса Шевченка,et al.  Quo vadis? , 2013, Clinical chemistry.

[38]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[39]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[40]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[41]  Qi Tian,et al.  Scalable Person Re-identification: A Benchmark , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  W. Marsden I and J , 2012 .

[43]  Lu Yuan,et al.  Cross-Domain Correspondence Learning for Exemplar-Based Image Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[45]  Michael J. Black,et al.  Optical Flow Estimation Using a Spatial Pyramid Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Björn Ommer,et al.  A Variational U-Net for Conditional Appearance and Shape Generation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[49]  Andreas Rössler,et al.  FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces , 2018, ArXiv.

[50]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[51]  Leonid Sigal,et al.  DwNet: Dense warp-based network for pose-guided human video generation , 2019, BMVC.

[52]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[53]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[54]  Alexandros G. Dimakis,et al.  Your Local GAN: Designing Two Dimensional Local Attention Mechanisms for Generative Models , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[56]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57]  Thomas S. Huang,et al.  Generative Image Inpainting with Contextual Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Stephen Lin,et al.  Local Relation Networks for Image Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[59]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Wenhan Luo,et al.  Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[63]  Chen Huang,et al.  Dense Intrinsic Appearance Flow for Human Pose Transfer , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Nicu Sebe,et al.  Deformable GANs for Pose-Based Human Image Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.