3D Noise and Adversarial Supervision Is All You Need for Multi-modal Semantic Image Synthesis

Semantic image synthesis models suffer from training instabilities and poor image quality when trained with adversarial supervision alone. Historically, this was alleviated via an additional VGG-based perceptual loss. Hence, we propose a new simplified GAN model, which needs only adversarial supervision to achieve high-quality results. In doing so, we also show that the VGG supervision decreases image diversity and can hurt image quality. We achieve the improvement by redesigning the discriminator as a semantic segmentation network. The resulting stronger supervision makes the VGG loss obsolete. Moreover, in contrast to previous work, we enable high-quality multi-modal image synthesis through a novel noise sampling scheme. Compared to the state of the art, we achieve an average improvement of 6 FID and 7 mIoU.

[1]  Stefan Winkler,et al.  The Unusual Effectiveness of Averaging in GAN Training , 2018, ICLR.

[2]  Zhou Wang,et al.  Multiscale structural similarity for image quality assessment , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[3]  Taesung Park,et al.  Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[5]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Xiaogang Wang,et al.  Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis , 2019, NeurIPS.

[7]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[8]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11]  Bolei Zhou,et al.  Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.