Person Image Synthesis in Arbitrary 3D Poses Based on Part Affinity Fields

We consider the person image synthesis problem, where an output image is generated from an arbitrary source image and an arbitrary target 3D pose. Prior person image synthesizing methods usually use 2D keypoint heatmaps to represent the target pose. However, this 2D representation can be ambiguous in self-occlusion scenarios due to the lack of depth information, resulting in generating inappropriate images. To solve this problem, we propose to synthesize person image from 3D poses. We introduce an improved part affinity field representation to describe the 3D configuration of the target pose and the 2D location of the person in the pixel space. Compared to using 2D poses, our synthesized images have visually better details, such as correct self-occlusion, brightness, face direction et al. Moreover, in contrast with prior person image generators, our method predicts the difference between the source image and target image instead of the output images. This strategy allows the network to generate better foreground and significantly reduces the noise in the image background. We evaluate our method on images of fifteen different actions on Human3.6M dataset. Extensive experiments demonstrate that our method can synthesize much better person images than 2D pose-based ones. If given a sequence of desired poses, our method can produce a sequence of temporally smooth and coherent images, even on another subject and actions, which means that our method has great potential to generate high-quality person videos.