Object-Centric Image Generation from Layouts

Despite recent impressive results on single-object and single-domain image generation, the generation of complex scenes with multiple objects remains challenging. In this paper, we start with the idea that a model must be able to understand individual objects and relationships between objects in order to generate complex scenes well. Our layout-to-image-generation method, which we call Object-Centric Generative Adversarial Network (or OC-GAN), relies on a novel Scene-Graph Similarity Module (SGSM). The SGSM learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity. We also propose changes to the conditioning mechanism of the generator that enhance its object instance-awareness. Apart from improving image quality, our contributions mitigate two failure modes in previous approaches: (1) spurious objects being generated without corresponding bounding boxes in the layout, and (2) overlapping bounding boxes in the layout leading to merged objects in images. Extensive quantitative evaluation and ablation studies demonstrate the impact of our contributions, with our model outperforming previous state-of-the-art approaches on both the COCO-Stuff and Visual Genome datasets. Finally, we address an important limitation of evaluation metrics used in previous works by introducing SceneFID -- an object-centric adaptation of the popular Fr{e}chet Inception Distance metric, that is better suited for multi-object images.

[1]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[2]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[3]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[4]  Jianfei Cai,et al.  Scene Graph Generation With External Knowledge and Image Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Christoph Goller,et al.  Learning task-dependent distributed representations by backpropagation through structure , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[6]  Hugo Larochelle,et al.  Modulating early visual processing by language , 2017, NIPS.

[7]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[8]  Kilian Q. Weinberger,et al.  An empirical study on evaluation metrics of generative adversarial networks , 2018, ArXiv.

[9]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[10]  Yoshua Bengio,et al.  Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Wei Sun,et al.  Image Synthesis From Reconfigurable Layout and Style , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[14]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[17]  Vladlen Koltun,et al.  Semi-Parametric Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Rishi Sharma,et al.  A Note on the Inception Score , 2018, ArXiv.

[20]  Anton van den Hengel,et al.  Graph-Structured Representations for Visual Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Taesung Park,et al.  Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[23]  Thomas Brox,et al.  Generating Images with Perceptual Similarity Metrics based on Deep Networks , 2016, NIPS.

[24]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[25]  Leon A. Gatys,et al.  Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  R. Devon Hjelm,et al.  Locality and compositionality in zero-shot learning , 2019, ICLR.

[28]  Takeru Miyato,et al.  cGANs with Projection Discriminator , 2018, ICLR.

[29]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Luc Van Gool,et al.  Exemplar Guided Unsupervised Image-to-Image Translation with Semantic Consistency , 2018, ICLR.

[32]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[33]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[34]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Melanie Mitchell,et al.  Semantic Image Retrieval via Active Grounding of Visual Situations , 2018, 2018 IEEE 12th International Conference on Semantic Computing (ICSC).

[36]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  He Ma,et al.  Quantitatively Evaluating GANs With Divergences Proposed for Training , 2018, ICLR.

[38]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[39]  F. Scarselli,et al.  A new model for learning in graph domains , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[40]  Xiaogang Wang,et al.  PasteGAN: A Semi-Parametric Method to Generate Image from Scene Graph , 2019, NeurIPS.

[41]  Nenghai Yu,et al.  Semantics Disentangling for Text-To-Image Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Li Fei-Fei,et al.  Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval , 2015, VL@EMNLP.

[45]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[46]  Wei Sun,et al.  Learning Layout and Style Reconfigurable GANs for Controllable Image Synthesis , 2020, ArXiv.

[47]  Gaurav Mittal,et al.  Interactive Image Generation Using Scene Graphs , 2019, DGS@ICLR.

[48]  Lei Zhang,et al.  Object-Driven Text-To-Image Synthesis via Adversarial Training , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[50]  Yong Jae Lee,et al.  FineGAN: Unsupervised Hierarchical Disentanglement for Fine-Grained Object Generation and Discovery , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Bo Zhao,et al.  Image Generation From Layout , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[53]  Jae Hyun Lim,et al.  Geometric GAN , 2017, ArXiv.

[54]  Jia Deng,et al.  Pixels to Graphs by Associative Embedding , 2017, NIPS.

[55]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[57]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[58]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[59]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[60]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[61]  Yu Cheng,et al.  StoryGAN: A Sequential Conditional GAN for Story Visualization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Yoshua Bengio,et al.  ChatPainter: Improving Text to Image Generation using Dialogue , 2018, ICLR.

[63]  R Devon Hjelm,et al.  Zero-Shot Learning from scratch (ZFS): leveraging local compositional representations , 2020, ArXiv.

[64]  Jiwen Lu,et al.  An Improved Evaluation Framework for Generative Adversarial Networks , 2018, ArXiv.

[65]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[66]  Mario Lucic,et al.  Are GANs Created Equal? A Large-Scale Study , 2017, NeurIPS.

[67]  Lior Wolf,et al.  Specifying Object Attributes and Relations in Interactive Scene Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).