Embedding Arithmetic for Text-driven Image Transformation

Latent text representations exhibit geometric regularities, such as the famous analogy: queen is to king what woman is to man. Such structured semantic relations were not demonstrated on image representations. Recent works aiming at bridging this semantic gap embed images and text into a multimodal space, enabling the transfer of text-defined transformations to the image modality. We introduce the SIMAT dataset to evaluate the task of text-driven image transformation. SIMAT contains 6k images and 18k “transformation queries” that aims at either replacing scene elements or changing their pairwise relationships. The goal is to retrieve an image consistent with the (source image, transformation) query. We use an image/text matching oracle (OSCAR) to assess whether the image transformation is successful. The SIMAT dataset will be publicly available. We use SIMAT to show that vanilla CLIP multimodal embeddings are not very well suited for text-driven image transformation, but that a simple finetuning on the COCO dataset can bring dramatic improvements. We also study whether it is beneficial to leverage the geometric properties of pretrained universal sentence encoders (FastText, LASER and LaBSE).

[1]  Martin Engilberge,et al.  Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Martin Kleinsteuber,et al.  Compositional Learning of Image-Text Query for Image Retrieval , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[3]  James Hays,et al.  Let's Transfer Transformations of Shared Semantic Representations , 2019, ArXiv.

[4]  Thomas Lukasiewicz,et al.  ManiGAN: Text-Guided Image Manipulation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Li Fei-Fei,et al.  Composing Text and Image for Image Retrieval - an Empirical Odyssey , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yale Song,et al.  Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Seonghyeon Nam,et al.  Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language , 2018, NeurIPS.

[8]  Edward H. Adelson,et al.  Discovering states and transformations in image collections , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Larry S. Davis,et al.  Automatic Spatially-Aware Fashion Concept Discovery , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[11]  Daniel Cohen-Or,et al.  StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Ulf Assarsson,et al.  A Benchmark for , 2001 .

[13]  Ewan Dunbar,et al.  Analogies minus analogy test: measuring regularities in word embeddings , 2020, CONLL.

[14]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[15]  Naveen Arivazhagan,et al.  Language-agnostic BERT Sentence Embedding , 2020, ArXiv.

[16]  Zhedong Zheng,et al.  Dual-path Convolutional Image-Text Embeddings with Instance Loss , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[17]  Jiebo Luo,et al.  Semantic Layout Manipulation with High-Resolution Sparse Attention , 2020, ArXiv.

[18]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[19]  Holger Schwenk,et al.  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[20]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Franck Dernoncourt,et al.  A Benchmark and Baseline for Language-Driven Image Editing , 2020, ACCV.

[22]  Baoyuan Wu,et al.  TediGAN: Text-Guided Diverse Image Generation and Manipulation , 2020, ArXiv.

[23]  Honglak Lee,et al.  An efficient framework for learning sentence representations , 2018, ICLR.

[24]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[25]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[26]  Kihyuk Sohn,et al.  Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[27]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[28]  Yupeng Gao,et al.  Fashion IQ: A New Dataset towards Retrieving Images by Natural Language Feedback , 2019 .

[29]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.