I2Edit: Towards Multi-turn Interactive Image Editing via Dialogue

Although there have been considerable research efforts on controllable facial image editing, the desirable interactive setting where the users can interact with the system to adjust their requirements dynamically hasn't been well explored. This paper focuses on facial image editing via dialogue and introduces a new benchmark dataset, Multi-turn Interactive Image Editing (I2Edit), for evaluating image editing quality and interaction ability in real-world interactive facial editing scenarios. The dataset is constructed upon the CelebA-HQ dataset with images annotated with a multi-turn dialogue that corresponds to the user editing requirements. I2Edit is challenging, as it needs to 1) track the dynamically updated user requirements and edit the images accordingly, as well as 2) generate the appropriate natural language response to communicate with the user. To address these challenges, we propose a framework consisting of a dialogue module and an image editing module. The former is for user edit requirements tracking and generating the corresponding indicative responses, while the latter edits the images conditioned on the tracked user edit requirements. In contrast to previous works that simply treat multi-turn interaction as a sequence of single-turn interactions, we extract the user edit requirements from the whole dialogue history instead of the current single turn. The extracted global user edit requirements enable us to directly edit the input raw image to avoid error accumulation and attribute forgetting issues. Extensive quantitative and qualitative experiments on the I2Edit dataset demonstrate the advantage of our proposed framework over the previous single-turn methods. We believe our new dataset could serve as a valuable resource to push forward the exploration of real-world, complex interactive image editing. Code and data will be made public.

[1]  A. Geramifard,et al.  Tell Your Story: Task-Oriented Dialogs for Interactive Content Creation , 2022, ArXiv.

[2]  Wenhu Chen,et al.  Controllable Dialogue Simulation with In-Context Learning , 2022, EMNLP.

[3]  Carel van Niekerk,et al.  Dynamic Dialogue Policy for Continual Reinforcement Learning , 2022, COLING.

[4]  Chen Change Loy,et al.  TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[6]  Elman Mansimov,et al.  Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System , 2021, ACL.

[7]  Chen Change Loy,et al.  Talk-to-Edit: Fine-Grained Facial Editing via Dialog , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Zhirui Zhang,et al.  Task-Oriented Dialogue System as Natural Language Generation , 2021, SIGIR.

[9]  Reut Tsarfaty,et al.  Draw Me a Flower: Processing and Grounding Abstraction in Natural Language , 2021, Transactions of the Association for Computational Linguistics.

[10]  Daniel Cohen-Or,et al.  StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[12]  Daniel Cohen-Or,et al.  Designing an encoder for StyleGAN image manipulation , 2021, ACM Trans. Graph..

[13]  Jennifer Foster,et al.  How to Make Neural Natural Language Generation as Reliable as Templates in Task-Oriented Dialogue , 2020, EMNLP.

[14]  Daniel Cohen-Or,et al.  Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[16]  R. Socher,et al.  A Simple Language Model for Task-Oriented Dialogue , 2020, Neural Information Processing Systems.

[17]  Hao Li,et al.  Intuitive, Interactive Beard and Hair Synthesis With Generative Models , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Tero Karras,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[20]  Richard Socher,et al.  Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems , 2019, ACL.

[21]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yoshua Bengio,et al.  Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Xinlei Chen,et al.  CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication , 2017, ACL.

[26]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[27]  Xiang Zhou,et al.  Agent-Aware Dropout DQN for Safe and Efficient On-line Dialogue Policy Learning , 2017, EMNLP.

[28]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[29]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[30]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[31]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[32]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[33]  Jacob Cohen,et al.  The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability , 1973 .

[34]  Zhenan Sun,et al.  Hierarchical Face Aging Through Disentangled Latent Characteristics , 2020, ECCV.