T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation

Despite the stunning ability to generate high-quality images by recent text-to-image models, current approaches often struggle to effectively compose objects with different attributes and relationships into a complex and coherent scene. We propose T2I-CompBench, a comprehensive benchmark for open-world compositional text-to-image generation, consisting of 6,000 compositional text prompts from 3 categories (attribute binding, object relationships, and complex compositions) and 6 sub-categories (color binding, shape binding, texture binding, spatial relationships, non-spatial relationships, and complex compositions). We further propose several evaluation metrics specifically designed to evaluate compositional text-to-image generation. We introduce a new approach, Generative mOdel fine-tuning with Reward-driven Sample selection (GORS), to boost the compositional text-to-image generation abilities of pretrained text-to-image models. Extensive experiments and evaluations are conducted to benchmark previous methods on T2I-CompBench, and to validate the effectiveness of our proposed evaluation metrics and GORS approach. Project page is available at https://karine-h.github.io/T2I-CompBench/.

[1]  C. Ding,et al.  X-IQE: eXplainable Image Quality Evaluation for Text-to-Image Generation with Visual Large Language Models , 2023, ArXiv.

[2]  William Yang Wang,et al.  LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation , 2023, NeurIPS.

[3]  Mohamed Elhoseiny,et al.  MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , 2023, ArXiv.

[4]  T. Zhang,et al.  RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment , 2023, ArXiv.

[5]  Mohamed Elhoseiny,et al.  HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Yang Zhang,et al.  Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  A. Vedaldi,et al.  Training-Free Layout Control with Cross-Attention Guidance , 2023, ArXiv.

[8]  S. Savarese,et al.  HIVE: Harnessing Human Feedback for Instructional Visual Editing , 2023, ArXiv.

[9]  P. Abbeel,et al.  Aligning Text-to-Image Models using Human Feedback , 2023, ArXiv.

[10]  Lior Wolf,et al.  Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models , 2023, ACM Trans. Graph..

[11]  S. Savarese,et al.  BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[12]  W. Freeman,et al.  Muse: Text-To-Image Generation via Masked Generative Transformers , 2023, ICML.

[13]  William Yang Wang,et al.  Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis , 2022, ICLR.

[14]  Vitali Petsiuk Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark , 2022, ArXiv.

[15]  J. Tenenbaum,et al.  Compositional Visual Generation with Composable Diffusion Models , 2022, ECCV.

[16]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[17]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[18]  Martin Renqiang Min,et al.  StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yaniv Taigman,et al.  Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors , 2022, ECCV.

[20]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[21]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[23]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[24]  David J. Fleet,et al.  Cascaded Diffusion Models for High Fidelity Image Generation , 2021, J. Mach. Learn. Res..

[25]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[26]  Ronan Le Bras,et al.  CLIPScore: A Reference-free Evaluation Metric for Image Captioning , 2021, EMNLP.

[27]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[28]  Philipp Krähenbühl,et al.  Simple Multi-dataset Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[30]  Prafulla Dhariwal,et al.  Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[31]  Jing Yu Koh,et al.  Cross-Modal Contrastive Learning for Text-to-Image Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jian Sun,et al.  Objects365: A Large-Scale, High-Quality Dataset for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Wei Chen,et al.  DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[36]  Peter Kontschieder,et al.  The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[38]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[40]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[41]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[42]  Aaron C. Courville,et al.  Generative Adversarial Networks , 2014, 1406.2661.

[43]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[44]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[45]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[46]  Mohit Bansal,et al.  DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers , 2022, ArXiv.

[47]  Trevor Darrell,et al.  Benchmark for Compositional Text-to-Image Synthesis , 2021, NeurIPS Datasets and Benchmarks.

[48]  Sixth Indian Conference on Computer Vision, Graphics & Image Processing, ICVGIP 2008, Bhubaneswar, India, 16-19 December 2008 , 2008, ICVGIP.