论文信息 - 3D Noise and Adversarial Supervision Is All You Need for Multi-modal Semantic Image Synthesis

3D Noise and Adversarial Supervision Is All You Need for Multi-modal Semantic Image Synthesis

Semantic image synthesis models suffer from training instabilities and poor image quality when trained with adversarial supervision alone. Historically, this was alleviated via an additional VGG-based perceptual loss. Hence, we propose a new simplified GAN model, which needs only adversarial supervision to achieve high-quality results. In doing so, we also show that the VGG supervision decreases image diversity and can hurt image quality. We achieve the improvement by redesigning the discriminator as a semantic segmentation network. The resulting stronger supervision makes the VGG loss obsolete. Moreover, in contrast to previous work, we enable high-quality multi-modal image synthesis through a novel noise sampling scheme. Compared to the state of the art, we achieve an average improvement of 6 FID and 7 mIoU.

[1] Stefan Winkler,et al. The Unusual Effectiveness of Averaging in GAN Training , 2018, ICLR.

[2] Zhou Wang,et al. Multiscale structural similarity for image quality assessment , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[3] Taesung Park,et al. Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[5] Vittorio Ferrari,et al. COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6] Xiaogang Wang,et al. Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis , 2019, NeurIPS.

[7] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[8] Alexei A. Efros,et al. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9] Sebastian Ramos,et al. The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11] Bolei Zhou,et al. Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Jan Kautz,et al. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.