Text-to-Image Generation Grounded by Fine-Grained User Attention

Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces that provide a sparse, fine-grained visual grounding for phrases. We propose TReCS, a sequential model that exploits this grounding to generate images. TReCS uses descriptions to retrieve segmentation masks and predict object labels aligned with mouse traces. These alignments are used to select and position masks to generate a fully covered segmentation canvas; the final image is produced by a segmentation-to-image generator using this canvas. This multi-step, retrieval-based approach outperforms existing direct text-to-image generation models on both automatic metrics and human evaluations: overall, its generated images are more photo-realistic and better match descriptions.

[1]  Seunghoon Hong,et al.  Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[3]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[4]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[5]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[6]  Rishi Sharma,et al.  A Note on the Inception Score , 2018, ArXiv.

[7]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[9]  Yoshua Bengio,et al.  Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[11]  Lei Zhang,et al.  Object-Driven Text-To-Image Synthesis via Adversarial Training , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ray Kurzweil,et al.  Improving Multilingual Sentence Embedding using Bi-directional Dual Encoder with Additive Margin Softmax , 2019, IJCAI.

[14]  Richard Sproat,et al.  WordsEye: an automatic text-to-scene conversion system , 2001, SIGGRAPH.

[15]  Jordi Pont-Tuset,et al.  Connecting Vision and Language with Localized Narratives , 2019, ECCV.

[16]  Taesung Park,et al.  Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jason Baldridge,et al.  Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO , 2020, EACL.

[19]  Ray Kurzweil,et al.  Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model , 2019, RepL4NLP@ACL.

[20]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[23]  Douglas Eck,et al.  A Neural Representation of Sketch Drawings , 2017, ICLR.

[24]  Gabriel Ilharco,et al.  Large-Scale Representation Learning from Visually Grounded Untranscribed Speech , 2019, CoNLL.

[25]  Daniel Gillick,et al.  End-to-End Retrieval in Continuous Space , 2018, ArXiv.

[26]  Jason Baldridge,et al.  Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries , 2012, EMNLP.

[27]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Vicente Ordonez,et al.  Text2Scene: Generating Compositional Scenes From Textual Descriptions , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Mauro Di Manzo,et al.  Natural Language Input For Scene Generation , 1983, EACL.

[30]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Xiaogang Wang,et al.  PasteGAN: A Semi-Parametric Method to Generate Image from Scene Graph , 2019, NeurIPS.

[33]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[34]  Xiaogang Wang,et al.  Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis , 2019, NeurIPS.

[35]  PietraVincent J. Della,et al.  The mathematics of statistical machine translation , 1993 .

[36]  Yoshua Bengio,et al.  ChatPainter: Improving Text to Image Generation using Dialogue , 2018, ICLR.

[37]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[38]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.