Semantic Object Accuracy for Generative Text-to-Image Synthesis

Generative adversarial networks conditioned on textual image descriptions are capable of generating realistic-looking images. However, current methods still struggle to generate images based on complex image captions from a heterogeneous domain. Furthermore, quantitatively evaluating these text-to-image models is challenging, as most evaluation metrics only judge image quality but not the conformity between the image and its caption. To address these challenges we introduce a new model that explicitly models individual objects within an image and a new evaluation metric called Semantic Object Accuracy (SOA) that specifically evaluates images given an image caption. The SOA uses a pre-trained object detector to evaluate if a generated image contains objects that are mentioned in the image caption, e.g. whether an image generated from "a car driving down the street" contains a car. We perform a user study comparing several text-to-image models and show that our SOA metric ranks the models the same way as humans, whereas other metrics such as the Inception Score do not. Our evaluation also shows that models which explicitly model objects outperform models which only model global image characteristics.

[1]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[2]  Gaurav Mittal,et al.  Interactive Image Generation Using Scene Graphs , 2019, DGS@ICLR.

[3]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Akihiro Sugimoto,et al.  Visual-Relation Conscious Image Generation from Structured-Text , 2019, ECCV.

[5]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[6]  Yoshua Bengio,et al.  Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Jan Kautz,et al.  Context-aware Synthesis and Placement of Object Instances , 2018, NeurIPS.

[8]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[9]  Cordelia Schmid,et al.  How good is my GAN? , 2018, ECCV.

[10]  Seonghyeon Nam,et al.  Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language , 2018, NeurIPS.

[11]  Tingfa Xu,et al.  LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators , 2019, ICLR.

[12]  Minglun Gong,et al.  Hierarchically-Fused Generative Adversarial Network for Text to Realistic Image Synthesis , 2019, 2019 16th Conference on Computer and Robot Vision (CRV).

[13]  Aykut Erdem,et al.  Learning to Generate Images of Outdoor Scenes from Attributes and Semantic Layouts , 2016, ArXiv.

[14]  Jiachen Li,et al.  Text Guided Person Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yu Cheng,et al.  StoryGAN: A Sequential Conditional GAN for Story Visualization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Vladlen Koltun,et al.  Photographic Image Synthesis with Cascaded Refinement Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[19]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Lin Yang,et al.  Photographic Text-to-Image Synthesis with a Hierarchically-Nested Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[22]  Jing Zhang,et al.  MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Lluís Màrquez i Villodre,et al.  Linguistic Features for Automatic Evaluation of Heterogenous MT Systems , 2007, WMT@ACL.

[25]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[26]  Nando de Freitas,et al.  Generating Interpretable Images with Controllable Structure , 2017 .

[27]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[28]  Rishi Sharma,et al.  A Note on the Inception Score , 2018, ArXiv.

[29]  Yu Cheng,et al.  Sequential Attention GAN for Interactive Image Editing via Dialogue , 2018, ArXiv.

[30]  Nenghai Yu,et al.  Semantics Disentangling for Text-To-Image Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Greg Mori,et al.  Probabilistic Neural Programmed Networks for Scene Generation , 2018, NeurIPS.

[33]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Alejandro Betancourt,et al.  Egoshots, an ego-vision life-logging dataset and semantic fidelity metric to evaluate diversity in image captioning models , 2020, ICLR 2020.

[35]  Jian Gu,et al.  Seq-SG2SL: Inferring Semantic Layout From Scene Graph Through Sequence to Sequence Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[37]  Wei Sun,et al.  Image Synthesis From Reconfigurable Layout and Style , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Yoshua Bengio,et al.  ChatPainter: Improving Text to Image Generation using Dialogue , 2018, ICLR.

[39]  Bo Zhao,et al.  Image Generation From Layout , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[41]  Seunghoon Hong,et al.  Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Seunghoon Hong,et al.  Learning Hierarchical Semantic Image Manipulation through Structured Representations , 2018, NeurIPS.

[43]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[44]  Yike Guo,et al.  SIMGAN: Photo-Realistic Semantic Image Manipulation Using Generative Adversarial Networks , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[45]  Suman V. Ravuri,et al.  Classification Accuracy Score for Conditional Generative Models , 2019, NeurIPS.

[46]  Xiaogang Wang,et al.  PasteGAN: A Semi-Parametric Method to Generate Image from Scene Graph , 2019, NeurIPS.

[47]  Stefan Wermter,et al.  Generating Multiple Objects at Spatially Distinct Locations , 2019, ICLR.

[48]  Jiawei He,et al.  LayoutVAE: Stochastic Scene Layout Generation From a Label Set , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Wei Chen,et al.  DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Dacheng Tao,et al.  Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge , 2019, NeurIPS.

[51]  Thomas Lukasiewicz,et al.  Controllable Text-to-Image Generation , 2019, NeurIPS.

[52]  Lucia Specia,et al.  VIFIDEL: Evaluating the Visual Fidelity of Image Descriptions , 2019, ACL.

[53]  Ali Borji,et al.  Pros and Cons of GAN Evaluation Measures , 2018, Comput. Vis. Image Underst..

[54]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[55]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[56]  Lei Zhang,et al.  Object-Driven Text-To-Image Synthesis via Adversarial Training , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Taesung Park,et al.  Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Ian Oppermann,et al.  Realistic Image Generation using Region-phrase Attention , 2019, ACML.

[59]  H. T. Kung,et al.  Adversarial Learning of Semantic Relevance in Text to Image Synthesis , 2018, AAAI.

[60]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[61]  Andreas E. Savakis,et al.  Semantically Invariant Text-to-Image Generation , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).