Semantic Object Accuracy for Generative Text-to-Image Synthesis

Generative adversarial networks conditioned on textual image descriptions are capable of generating realistic-looking images. However, current methods still struggle to generate images based on complex image captions from a heterogeneous domain. Furthermore, quantitatively evaluating these text-to-image models is challenging, as most evaluation metrics only judge image quality but not the conformity between the image and its caption. To address these challenges we introduce a new model that explicitly models individual objects within an image and a new evaluation metric called Semantic Object Accuracy (SOA) that specifically evaluates images given an image caption. The SOA uses a pre-trained object detector to evaluate if a generated image contains objects that are mentioned in the image caption, e.g. whether an image generated from "a car driving down the street" contains a car. We perform a user study comparing several text-to-image models and show that our SOA metric ranks the models the same way as humans, whereas other metrics such as the Inception Score do not. Our evaluation also shows that models which explicitly model objects outperform models which only model global image characteristics.

[1]  Lucia Specia,et al.  VIFIDEL: Evaluating the Visual Fidelity of Image Descriptions , 2019, ACL.

[2]  Wei Sun,et al.  Image Synthesis From Reconfigurable Layout and Style , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Jiachen Li,et al.  Text Guided Person Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Vladlen Koltun,et al.  Photographic Image Synthesis with Cascaded Refinement Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  H. T. Kung,et al.  Adversarial Learning of Semantic Relevance in Text to Image Synthesis , 2018, AAAI.

[7]  Yu Cheng,et al.  StoryGAN: A Sequential Conditional GAN for Story Visualization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Seunghoon Hong,et al.  Learning Hierarchical Semantic Image Manipulation through Structured Representations , 2018, NeurIPS.

[9]  Nenghai Yu,et al.  Semantics Disentangling for Text-To-Image Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[11]  Yike Guo,et al.  SIMGAN: Photo-Realistic Semantic Image Manipulation Using Generative Adversarial Networks , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[12]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[13]  Lin Yang,et al.  Photographic Text-to-Image Synthesis with a Hierarchically-Nested Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Minglun Gong,et al.  Hierarchically-Fused Generative Adversarial Network for Text to Realistic Image Synthesis , 2019, 2019 16th Conference on Computer and Robot Vision (CRV).

[15]  Ian Oppermann,et al.  Realistic Image Generation using Region-phrase Attention , 2019, ACML.

[16]  Cordelia Schmid,et al.  How good is my GAN? , 2018, ECCV.

[17]  Seunghoon Hong,et al.  Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Suman V. Ravuri,et al.  Classification Accuracy Score for Conditional Generative Models , 2019, NeurIPS.

[19]  Nando de Freitas,et al.  Generating Interpretable Images with Controllable Structure , 2017 .

[20]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[21]  Akihiro Sugimoto,et al.  Visual-Relation Conscious Image Generation from Structured-Text , 2019, ECCV.

[22]  Greg Mori,et al.  Probabilistic Neural Programmed Networks for Scene Generation , 2018, NeurIPS.

[23]  Jian Gu,et al.  Seq-SG2SL: Inferring Semantic Layout From Scene Graph Through Sequence to Sequence Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Lei Zhang,et al.  Object-Driven Text-To-Image Synthesis via Adversarial Training , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[26]  Bo Zhao,et al.  Image Generation From Layout , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ali Borji,et al.  Pros and Cons of GAN Evaluation Measures , 2018, Comput. Vis. Image Underst..

[28]  Stefan Wermter,et al.  Generating Multiple Objects at Spatially Distinct Locations , 2019, ICLR.

[29]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[30]  Compositional Generation of Images , 2017 .

[31]  Gaurav Mittal,et al.  Interactive Image Generation Using Scene Graphs , 2019, DGS@ICLR.

[32]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[33]  Andreas E. Savakis,et al.  Semantically Invariant Text-to-Image Generation , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[34]  Taesung Park,et al.  Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[36]  Rishi Sharma,et al.  A Note on the Inception Score , 2018, ArXiv.

[37]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[38]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[39]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[42]  Yoshua Bengio,et al.  Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Dacheng Tao,et al.  Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge , 2019, NeurIPS.

[44]  Xiaogang Wang,et al.  PasteGAN: A Semi-Parametric Method to Generate Image from Scene Graph , 2019, NeurIPS.

[45]  Aykut Erdem,et al.  Learning to Generate Images of Outdoor Scenes from Attributes and Semantic Layouts , 2016, ArXiv.

[46]  Seonghyeon Nam,et al.  Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language , 2018, NeurIPS.

[47]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[48]  Tingfa Xu,et al.  LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators , 2019, ICLR.

[49]  Alejandro Betancourt,et al.  Egoshots, an ego-vision life-logging dataset and semantic fidelity metric to evaluate diversity in image captioning models , 2020, ICLR 2020.

[50]  Jan Kautz,et al.  Context-aware Synthesis and Placement of Object Instances , 2018, NeurIPS.

[51]  Lluís Màrquez i Villodre,et al.  Linguistic Features for Automatic Evaluation of Heterogenous MT Systems , 2007, WMT@ACL.

[52]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[53]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Jing Zhang,et al.  MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Wei Chen,et al.  DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Yu Cheng,et al.  Sequential Attention GAN for Interactive Image Editing via Dialogue , 2018, ArXiv.

[58]  Yoshua Bengio,et al.  ChatPainter: Improving Text to Image Generation using Dialogue , 2018, ICLR.

[59]  Thomas Lukasiewicz,et al.  Controllable Text-to-Image Generation , 2019, NeurIPS.

[60]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[61]  Jiawei He,et al.  LayoutVAE: Stochastic Scene Layout Generation From a Label Set , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[62]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).