Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model

To achieve disentangled image manipulation, previous works depend heavily on manual annotation. Meanwhile, the available manipulations are limited to a pre-defined set the models were trained for. In this paper, we propose a novel framework, i.e., Predict, Prevent, and Evaluate (PPE), for disentangled text-driven image manipulation, which does not need manual annotation and thus is not limited to fixed manipulations. Our method approaches the targets by deeply exploiting the power of the large scale pretrained vision-language model CLIP. Concretely, we firstly Predict the possibly entangled attributes for a given text command. Then, based on the predicted attributes, we introduce an entanglement loss to Prevent entanglements during training. Finally, we propose a new evaluation metric to Evaluate the disentangled image manipulation. We verify the effectiveness of our method on the challenging face editing task. Extensive experiments show that the proposed PPE framework achieves much better quantitative and qualitative results than the up-to-date StyleCLIP baseline. We will release our source code recently. *Work was carried out during the internship at VIS, Baidu.

[1]  Daniel Cohen-Or,et al.  StyleGAN-NADA , 2021, ACM Trans. Graph..

[2]  Nicu Sebe,et al.  Smoothing the Disentangled Latent Style Space for Unsupervised Image-to-Image Translation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Peter Wonka,et al.  StyleFlow: Attribute-conditioned Exploration of StyleGAN-Generated Images using Conditional Continuous Normalizing Flows , 2020, ArXiv.

[4]  Binxu Wang,et al.  A Geometric Analysis of Deep Generative Image Models and Its Applications , 2021, ICLR.

[5]  Daniel Cohen-Or,et al.  Face identity disentanglement via latent space mapping , 2020, ACM Trans. Graph..

[6]  Thomas Lukasiewicz,et al.  ManiGAN: Text-Guided Image Manipulation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Liujuan Cao,et al.  Image-to-image Translation via Hierarchical Style Disentanglement , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Dani Lischinski,et al.  StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Thomas Lukasiewicz,et al.  Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation , 2020, NeurIPS.

[10]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Artem Babenko,et al.  Unsupervised Discovery of Interpretable Directions in the GAN Latent Space , 2020, ICML.

[12]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Yann Gousseau,et al.  A Latent Transformer for Disentangled Face Editing in Images and Videos , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Baoyuan Wu,et al.  TediGAN: Text-Guided Diverse Face Image Generation and Manipulation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Seonghyeon Nam,et al.  Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language , 2018, NeurIPS.

[17]  Daniel Cohen-Or,et al.  Designing an encoder for StyleGAN image manipulation , 2021, ACM Trans. Graph..

[18]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[19]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[20]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[21]  Lei Zhang,et al.  VinVL: Making Visual Representations Matter in Vision-Language Models , 2021, ArXiv.

[22]  Phillip Isola,et al.  On the "steerability" of generative adversarial networks , 2019, ICLR.

[23]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[24]  Nicu Sebe,et al.  Describe What to Change: A Text-guided Unsupervised Image-to-image Translation Approach , 2020, ACM Multimedia.

[25]  Deli Zhao,et al.  In-Domain GAN Inversion for Real Image Editing , 2020, ECCV.

[26]  Roger B. Grosse,et al.  Isolating Sources of Disentanglement in Variational Autoencoders , 2018, NeurIPS.

[27]  Daniel Cohen-Or,et al.  Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Christian Theobalt,et al.  StyleRig: Rigging StyleGAN for 3D Control Over Portrait Images , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yedid Hoshen,et al.  Demystifying Inter-Class Disentanglement , 2020, ICLR.

[30]  Bolei Zhou,et al.  Closed-Form Factorization of Latent Semantics in GANs , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Bolei Zhou,et al.  Interpreting the Latent Space of GANs for Semantic Face Editing , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Baoyuan Wu,et al.  Towards Open-World Text-Guided Face Image Generation and Manipulation , 2021, ArXiv.

[33]  Yike Guo,et al.  Semantic Image Synthesis via Adversarial Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Yedid Hoshen,et al.  Scaling-up Disentanglement for Image Translation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[36]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[37]  Daniel Cohen-Or,et al.  StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Andriy Mnih,et al.  Disentangling by Factorising , 2018, ICML.

[39]  Daniel Cohen-Or,et al.  Pivotal Tuning for Latent-based Editing of Real Images , 2021, ACM Trans. Graph..

[40]  Yedid Hoshen,et al.  An Image is Worth More Than a Thousand Words: Towards Disentanglement in the Wild , 2021, NeurIPS.

[41]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[42]  Jaakko Lehtinen,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  L. B. Soros,et al.  CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders , 2021, NeurIPS.

[44]  Ilya Sutskever,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[45]  Anjul Patney,et al.  Semi-Supervised StyleGAN for Disentanglement Learning , 2020, ICML.

[46]  Jaakko Lehtinen,et al.  GANSpace: Discovering Interpretable GAN Controls , 2020, NeurIPS.