Background and foreground disentangled generative adversarial network for scene image synthesis

Abstract Despite recent generative models have made remarkable progress on adversarial image synthesis, it is still a pivotal and frontier problem to generate high-fidelity images containing diverse entities and complex scene layouts from structured descriptions. To this end, we present a Background and Foreground Disentangled Generative Adversarial Network (BFD-GAN) to synthesize high-quality images from scene graphs. First, our method uses the graph convolutional network to infer a semantic background from the input scene graph. Then, the foreground parsing module that encourages unsupervised generation, is proposed to calculate semantically related foregrounds with fine-grained geometric properties. Furthermore, we also employ the foreground-background integrating module for the final image generation, during which the foreground-relation aware attention is introduced to refine and fuse the inferred foregrounds into the background. Evaluated on the COCO-Stuff and Visual Genome datasets, we benchmark our model against existing methods and show that our BFD-GAN is more capable of generating complex backgrounds and corresponding sharp foregrounds with given scene structures.

[1]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[2]  Nenghai Yu,et al.  Semantics Disentangling for Text-To-Image Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Gang Hu,et al.  Sharp and Real Image Super-Resolution Using Generative Adversarial Network , 2017, ICONIP.

[4]  Chi-Keung Tang,et al.  Image Generation from Sketch Constraint Using Contextual GAN , 2017, ECCV.

[5]  Ruigang Yang,et al.  ApolloCar3D: A Large 3D Car Instance Understanding Benchmark for Autonomous Driving , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Weifu Chen,et al.  A novel attribute-based generation architecture for facial image editing , 2020, Multimedia Tools and Applications.

[7]  Hugo Larochelle,et al.  Topic Modeling of Multimodal Data: An Autoregressive Approach , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[9]  Bo Zhao,et al.  Image Generation From Layout , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Gregory D. Hager,et al.  Semantic Image Manipulation Using Scene Graphs , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Lizhi Lin,et al.  Improving Variational Auto-Encoder with Self-Attention and Mutual Information for Image Generation , 2019, ICVIP.

[13]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Wei Sun,et al.  Image Synthesis From Reconfigurable Layout and Style , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Xiao-Ping Zhang,et al.  DDcGAN: A Dual-Discriminator Conditional Generative Adversarial Network for Multi-Resolution Image Fusion , 2020, IEEE Transactions on Image Processing.

[16]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[18]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[19]  Jing Zhang,et al.  MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[21]  Yoshua Bengio,et al.  Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  H. T. Kung,et al.  Adversarial Learning of Semantic Relevance in Text to Image Synthesis , 2018, AAAI.

[23]  Andreas E. Savakis,et al.  Semantically Invariant Text-to-Image Generation , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[24]  Nishchal K. Verma,et al.  Generation of future image frames using autoregressive model , 2015, 2015 IEEE 10th Conference on Industrial Electronics and Applications (ICIEA).

[25]  Jian Gu,et al.  Seq-SG2SL: Inferring Semantic Layout From Scene Graph Through Sequence to Sequence Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[27]  Wei Liu,et al.  Multi-Modal Curriculum Learning for Semi-Supervised Image Classification , 2016, IEEE Transactions on Image Processing.

[28]  Taesung Park,et al.  Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[30]  Xiaogang Wang,et al.  PasteGAN: A Semi-Parametric Method to Generate Image from Scene Graph , 2019, NeurIPS.

[31]  Minglun Gong,et al.  Hierarchically-Fused Generative Adversarial Network for Text to Realistic Image Synthesis , 2019, 2019 16th Conference on Computer and Robot Vision (CRV).

[32]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[33]  Tali Dekel,et al.  SinGAN: Learning a Generative Model From a Single Natural Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Lior Wolf,et al.  Specifying Object Attributes and Relations in Interactive Scene Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Nick Barnes,et al.  Adversarial Training of Variational Auto-Encoders for High Fidelity Image Generation , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[36]  Yu Cheng,et al.  StoryGAN: A Sequential Conditional GAN for Story Visualization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Meng Wang,et al.  Quality-Aware Unpaired Image-to-Image Translation , 2019, IEEE Transactions on Multimedia.

[38]  Shuqiang Jiang,et al.  Know More Say Less: Image Captioning Based on Scene Graphs , 2019, IEEE Transactions on Multimedia.

[39]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[40]  Akihiro Sugimoto,et al.  Visual-Relation Conscious Image Generation from Structured-Text , 2019, ECCV.

[41]  Jan Kautz,et al.  Multimodal Unsupervised Image-to-Image Translation , 2018, ECCV.

[42]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Wei Wang,et al.  LCSCNet: Linear Compressing-Based Skip-Connecting Network for Image Super-Resolution , 2019, IEEE Transactions on Image Processing.

[44]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.