Dual Attention GANs for Semantic Image Synthesis

In this paper, we focus on the semantic image synthesis task that aims at transferring semantic label maps to photo-realistic images. Existing methods lack effective semantic constraints to preserve the semantic information and ignore the structural correlations in both spatial and channel dimensions, leading to unsatisfactory blurry and artifact-prone results. To address these limitations, we propose a novel Dual Attention GAN (DAGAN) to synthesize photo-realistic and semantically-consistent images with fine details from the input layouts without imposing extra training overhead or modifying the network architectures of existing methods. We also propose two novel modules, i.e., position-wise Spatial Attention Module (SAM) and scale-wise Channel Attention Module (CAM), to capture semantic structure attention in spatial and channel dimensions, respectively. Specifically, SAM selectively correlates the pixels at each position by a spatial attention map, leading to pixels with the same semantic label being related to each other regardless of their spatial distances. Meanwhile, CAM selectively emphasizes the scale-wise features at each channel by a channel attention map, which integrates associated features among all channel maps regardless of their scales. We finally sum the outputs of SAM and CAM to further improve feature representation. Extensive experiments on four challenging datasets show that DAGAN achieves remarkably better results than state-of-the-art methods, while using fewer model parameters.

[1]  Fu Li,et al.  Dualattn-GAN: Text to Image Synthesis With Dual Attentional Generative Adversarial Network , 2019, IEEE Access.

[2]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Lu Yuan,et al.  Mask-Guided Portrait Editing With Conditional GANs , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Shi-Min Hu,et al.  Example-Guided Style-Consistent Image Synthesis From Semantic Labeling , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[7]  Nicu Sebe,et al.  GestureGAN for Hand Gesture-to-Gesture Translation in the Wild , 2018, ACM Multimedia.

[8]  Edward Y. Chang,et al.  RelGAN: Multi-Domain Image-to-Image Translation via Relative Attributes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Bolei Zhou,et al.  Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Songhua Xu,et al.  Sparsely Grouped Multi-Task Generative Adversarial Networks for Facial Attribute Manipulation , 2018, ACM Multimedia.

[11]  Dacheng Tao,et al.  Attention-GAN for Object Transfiguration in Wild Images , 2018, ECCV.

[12]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Nicu Sebe,et al.  Multi-Channel Attention Selection GAN With Cascaded Semantic Guidance for Cross-View Image Translation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Hugo Latapie,et al.  Exocentric to Egocentric Image Generation Via Parallel Generative Adversarial Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Nicu Sebe,et al.  Attention-Guided Generative Adversarial Networks for Unsupervised Image-to-Image Translation , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[16]  Changxin Gao,et al.  Relevant Region Prediction for Crowd Counting , 2020, Neurocomputing.

[17]  Yuning Jiang,et al.  Unified Perceptual Parsing for Scene Understanding , 2018, ECCV.

[18]  Thomas A. Funkhouser,et al.  Dilated Residual Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Tali Dekel,et al.  SinGAN: Learning a Generative Model From a Single Natural Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[22]  Kwang In Kim,et al.  Unsupervised Attention-guided Image to Image Translation , 2018, NeurIPS.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Frédo Durand,et al.  Synthesizing Images of Humans in Unseen Poses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Nicu Sebe,et al.  Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Xiaoming Yu,et al.  Multi-mapping Image-to-Image Translation via Learning Disentanglement , 2019, NeurIPS.

[27]  Minjae Kim,et al.  U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation , 2019, ICLR.

[28]  Lingyun Wu,et al.  MaskGAN: Towards Diverse and Interactive Facial Image Manipulation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Lorenzo Bruzzone,et al.  LANet: Local Attention Embedding to Improve the Semantic Segmentation of Remote Sensing Images , 2021, IEEE Trans. Geosci. Remote. Sens..

[30]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Vladlen Koltun,et al.  Photographic Image Synthesis with Cascaded Refinement Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Nicu Sebe,et al.  Cycle In Cycle Generative Adversarial Networks for Keypoint-Guided Image Generation , 2019, ACM Multimedia.

[33]  Iasonas Kokkinos,et al.  Dense Pose Transfer , 2018, ECCV.

[34]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[35]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Radim Sára,et al.  Spatial Pattern Templates for Recognition of Objects with Regular Structure , 2013, GCPR.

[37]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[39]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[40]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Hugo Latapie,et al.  Cascade Attention Guided Residue Learning GAN for Cross-Modal Translation , 2019, 2020 25th International Conference on Pattern Recognition (ICPR).

[42]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Guosheng Lin,et al.  Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[45]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[46]  Chi-Keung Tang,et al.  Attribute-Guided Face Generation Using Conditional CycleGAN , 2017, ECCV.

[47]  Nicu Sebe,et al.  Dual In-painting Model for Unsupervised Gaze Correction and Animation in the Wild , 2020, ACM Multimedia.

[48]  Xiaogang Wang,et al.  Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis , 2019, NeurIPS.

[49]  Nicu Sebe,et al.  XingGAN for Person Image Generation , 2020, ECCV.

[50]  Ling Shao,et al.  Relational Attention Network for Crowd Counting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[51]  Francesc Moreno-Noguer,et al.  GANimation: Anatomically-aware Facial Animation from a Single Image , 2018, ECCV.

[52]  Thomas Lukasiewicz,et al.  Controllable Text-to-Image Generation , 2019, NeurIPS.

[53]  Nicu Sebe,et al.  Structured Attention Guided Convolutional Neural Fields for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[55]  Vladlen Koltun,et al.  Semi-Parametric Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Taesung Park,et al.  Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.