Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting

Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to input text prompts, while consistent with input images. We present Imagen Editor, a cascaded diffusion model built, by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training. In addition, Imagen Editor captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.

[1]  Elnaz Barshan,et al.  LECA: A Learned Approach for Efficient Cover-agnostic Watermarking , 2022, Electronic Imaging.

[2]  David J. Fleet,et al.  Image Super-Resolution via Iterative Refinement , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Holger Schwenk,et al.  DiffEdit: Diffusion-based semantic image editing with mask guidance , 2022, ICLR.

[4]  M. Irani,et al.  Imagic: Text-Based Real Image Editing with Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Y. Matias,et al.  UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image , 2022, ArXiv.

[6]  David J. Fleet,et al.  Imagen Video: High Definition Video Generation with Diffusion Models , 2022, ArXiv.

[7]  Yuanzhen Li,et al.  DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  J. Tenenbaum,et al.  Prompt-to-Prompt Image Editing with Cross Attention Control , 2022, ICLR.

[9]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[10]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[11]  Bingkun Bao,et al.  DE-Net: Dynamic Text-guided Image Editing Adversarial Networks , 2022, AAAI.

[12]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[13]  Jie Tang,et al.  CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers , 2022, NeurIPS.

[14]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[15]  Yanwei Fu,et al.  ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Tali Dekel,et al.  Text2LIVE: Text-Driven Layered Image and Video Editing , 2022, ECCV.

[17]  Yaniv Taigman,et al.  Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors , 2022, ECCV.

[18]  Renelito Delos Santos,et al.  LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[19]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[20]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  D. Lischinski,et al.  Blended Diffusion for Text-driven Editing of Natural Images , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  David J. Fleet,et al.  Palette: Image-to-Image Diffusion Models , 2021, SIGGRAPH.

[23]  Victor Lempitsky,et al.  Resolution-robust Large Mask Inpainting with Fourier Convolutions , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[24]  Nicholas Carlini,et al.  Deduplicating Training Data Makes Language Models Better , 2021, ACL.

[25]  Hua Wu,et al.  ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation , 2021, ArXiv.

[26]  Vinay Uday Prabhu,et al.  Multimodal datasets: misogyny, pornography, and malignant stereotypes , 2021, ArXiv.

[27]  Valentin De Bortoli,et al.  Diffusion Schrödinger Bridge with Applications to Score-Based Generative Modeling , 2021, NeurIPS.

[28]  Ronan Le Bras,et al.  CLIPScore: A Reference-free Evaluation Metric for Image Captioning , 2021, EMNLP.

[29]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[30]  Trevor Darrell,et al.  Benchmark for Compositional Text-to-Image Synthesis , 2021, NeurIPS Datasets and Benchmarks.

[31]  Vishrav Chaudhary,et al.  Findings of the WMT 2021 Shared Task on Quality Estimation , 2021, WMT.

[32]  Jong Chul Ye,et al.  DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models , 2021, ArXiv.

[33]  George D. C. Cavalcanti,et al.  Grave Artifacts in Image Inpainting: Investigating the Causes and Untangling the Factors , 2021, IEEE Access.

[34]  Klaus Greff,et al.  On the Binding Problem in Artificial Neural Networks , 2020, ArXiv.

[35]  Zhou Zhao,et al.  Text-Guided Image Inpainting , 2020, ACM Multimedia.

[36]  Peyman Milanfar,et al.  Rank-Smoothed Pairwise Learning In Perceptual Quality Assessment , 2020, 2020 IEEE International Conference on Image Processing (ICIP).

[37]  Thomas Lukasiewicz,et al.  ManiGAN: Text-Guided Image Manipulation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[39]  Yi Chern Tan,et al.  Assessing Social and Intersectional Biases in Contextualized Word Representations , 2019, NeurIPS.

[40]  Nanning Zheng,et al.  Recognizing Unseen Attribute-Object Pair with Generative Model , 2019, AAAI.

[41]  Thomas S. Huang,et al.  Free-Form Image Inpainting With Gated Convolution , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Thomas S. Huang,et al.  Generative Image Inpainting with Contextual Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Andrew G. Howard,et al.  Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation , 2018, ArXiv.

[44]  Alexander M. Rush,et al.  Sequence-Level Knowledge Distillation , 2016, EMNLP.

[45]  Qun Liu,et al.  Achieving Accurate Conclusions in Evaluation of Automatic Machine Translation Metrics , 2016, NAACL.

[46]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[47]  Naila Murray,et al.  AVA: A large-scale database for aesthetic visual analysis , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Friedrich Leisch,et al.  Quick, Simple and Reliable: Forced Binary Survey Questions , 2011 .