论文信息 - Sequential Attention GAN for Interactive Image Editing

Sequential Attention GAN for Interactive Image Editing

Most existing text-to-image synthesis tasks are static single-turn generation, based on pre-defined textual descriptions of images. To explore more practical and interactive real-life applications, we introduce a new task - Interactive Image Editing, where users can guide an agent to edit images via multi-turn textual commands on-the-fly. In each session, the agent takes a natural language description from the user as the input, and modifies the image generated in previous turn to a new design, following the user description. The main challenges in this sequential and interactive image generation task are two-fold: 1) contextual consistency between a generated image and the provided textual description; 2) step-by-step region-level modification to maintain visual consistency across the generated image sequence in each session. To address these challenges, we propose a novel Sequential Attention Generative Adversarial Network (SeqAttnGAN), which applies a neural state tracker to encode the previous image and the textual description in each turn of the sequence, and uses a GAN framework to generate a modified version of the image that is consistent with the preceding images and coherent with the description. To achieve better region-specific refinement, we also introduce a sequential attention mechanism into the model. To benchmark on the new task, we introduce two new datasets, Zap-Seq and DeepFashion-Seq, which contain multi-turn sessions with image-description sequences in the fashion domain. Experiments on both datasets show that the proposed SeqAttnGAN model outperforms state-of-the-art approaches on the interactive image editing task across all evaluation metrics including visual quality, image sequence coherence and text-image consistency.

[1] Fisher Yu,et al. TextureGAN: Controlling Deep Image Synthesis with Texture Patches , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3] Trevor Darrell,et al. Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[4] Geoffrey Zweig,et al. From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Xiaodong Liu,et al. Language-Based Image Editing with Recurrent Attentive Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[7] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[8] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[9] Trevor Darrell,et al. Segmentation from Natural Language Expressions , 2016, ECCV.

[10] Simon Osindero,et al. Conditional Generative Adversarial Nets , 2014, ArXiv.

[11] Alexei A. Efros,et al. Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[13] Jason D. Williams,et al. Learning to Globally Edit Images with Textual Description , 2018, ArXiv.

[14] Joelle Pineau,et al. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[15] Xiaogang Wang,et al. DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Alexei A. Efros,et al. Toward Multimodal Image-to-Image Translation , 2017, NIPS.

[17] Yu Cheng,et al. BachGAN: High-Resolution Image Synthesis From Salient Object Layout , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Kallirroi Georgila,et al. Conversational Image Editing: Incremental Intent Identification in a New Dialogue Task , 2018, SIGDIAL Conference.

[19] Dimitris N. Metaxas,et al. StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[20] Hugo Larochelle,et al. GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Yu Cheng,et al. Pedestrian-Synthesis-GAN: Generating Pedestrian Data in Real Scene and Beyond , 2018, ArXiv.

[22] Abhinav Gupta,et al. Generative Image Modeling Using Style and Structure Adversarial Networks , 2016, ECCV.

[23] Stefan Lee,et al. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24] Kallirroi Georgila,et al. Using Reinforcement Learning to Model Incrementality in a Fast-Paced Dialogue Game , 2017, SIGDIAL Conference.

[25] Wu Chou,et al. Discriminative learning in sequential pattern recognition , 2008, IEEE Signal Processing Magazine.

[26] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27] Alexei A. Efros,et al. Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Yu Cheng,et al. StoryGAN: A Sequential Conditional GAN for Story Visualization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Yuandong Tian,et al. CoDraw: Visual Dialog for Collaborative Drawing , 2017, ArXiv.

[30] Lior Wolf,et al. Specifying Object Attributes and Relations in Interactive Scene Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31] Claire Cardie,et al. The Neural Painter: Multi-Turn Image Generation , 2018, ArXiv.

[32] Zhe Gan,et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33] Yiming Yang,et al. MMD GAN: Towards Deeper Understanding of Moment Matching Network , 2017, NIPS.

[34] Sanja Fidler,et al. Be Your Own Prada: Fashion Synthesis with Structural Coherence , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35] Li Fei-Fei,et al. Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36] Yoshua Bengio,et al. ChatPainter: Improving Text to Image Generation using Dialogue , 2018, ICLR.

[37] Jan Kautz,et al. Unsupervised Image-to-Image Translation Networks , 2017, NIPS.

[38] Nuno Vasconcelos,et al. AGA: Attribute-Guided Augmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Lihong Li,et al. Neural Approaches to Conversational AI , 2019, Found. Trends Inf. Retr..

[40] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Jason Weston,et al. Learning End-to-End Goal-Oriented Dialog , 2016, ICLR.

[42] Jianfeng Gao,et al. Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation , 2017, IJCNLP.

[43] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Jonathon Shlens,et al. Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[45] José M. F. Moura,et al. Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Seonghyeon Nam,et al. Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language , 2018, NeurIPS.

[47] Jing Wang,et al. Walk and Learn: Facial Attribute Representation Learning from Egocentric Video and Contextual Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Kristen Grauman,et al. Fine-Grained Visual Comparisons with Local Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[49] Bo Zhao,et al. Memory-Augmented Attribute Manipulation Networks for Interactive Fashion Search , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Bo Zhao,et al. Image Generation From Layout , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Fisher Yu,et al. Scribbler: Controlling Deep Image Synthesis with Sketch and Color , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52] 拓海杉山,et al. “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[53] Rogério Schmidt Feris,et al. Dialog-based Interactive Image Retrieval , 2018, NeurIPS.

[54] Bernt Schiele,et al. Generative Adversarial Text to Image Synthesis , 2016, ICML.