论文信息 - T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation

T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation

Despite the stunning ability to generate high-quality images by recent text-to-image models, current approaches often struggle to effectively compose objects with different attributes and relationships into a complex and coherent scene. We propose T2I-CompBench, a comprehensive benchmark for open-world compositional text-to-image generation, consisting of 6,000 compositional text prompts from 3 categories (attribute binding, object relationships, and complex compositions) and 6 sub-categories (color binding, shape binding, texture binding, spatial relationships, non-spatial relationships, and complex compositions). We further propose several evaluation metrics specifically designed to evaluate compositional text-to-image generation. We introduce a new approach, Generative mOdel fine-tuning with Reward-driven Sample selection (GORS), to boost the compositional text-to-image generation abilities of pretrained text-to-image models. Extensive experiments and evaluations are conducted to benchmark previous methods on T2I-CompBench, and to validate the effectiveness of our proposed evaluation metrics and GORS approach. Project page is available at https://karine-h.github.io/T2I-CompBench/.

Xihui Liu | Zhenguo Li | Enze Xie | Kaiyue Sun | Kaiyi Huang

[1] C. Ding,et al. X-IQE: eXplainable Image Quality Evaluation for Text-to-Image Generation with Visual Large Language Models , 2023, ArXiv.

[2] William Yang Wang,et al. LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation , 2023, NeurIPS.

[3] Mohamed Elhoseiny,et al. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , 2023, ArXiv.

[4] T. Zhang,et al. RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment , 2023, ArXiv.

[5] Mohamed Elhoseiny,et al. HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[6] Yang Zhang,et al. Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[7] A. Vedaldi,et al. Training-Free Layout Control with Cross-Attention Guidance , 2023, ArXiv.

[8] S. Savarese,et al. HIVE: Harnessing Human Feedback for Instructional Visual Editing , 2023, ArXiv.

[9] P. Abbeel,et al. Aligning Text-to-Image Models using Human Feedback , 2023, ArXiv.

[10] Lior Wolf,et al. Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models , 2023, ACM Trans. Graph..

[11] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[12] W. Freeman,et al. Muse: Text-To-Image Generation via Masked Generative Transformers , 2023, ICML.

[13] William Yang Wang,et al. Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis , 2022, ICLR.

[14] Vitali Petsiuk. Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark , 2022, ArXiv.

[15] J. Tenenbaum,et al. Compositional Visual Generation with Composable Diffusion Models , 2022, ECCV.

[16] David J. Fleet,et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[17] Prafulla Dhariwal,et al. Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[18] Martin Renqiang Min,et al. StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Yaniv Taigman,et al. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors , 2022, ECCV.

[20] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[21] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Prafulla Dhariwal,et al. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[23] Yelong Shen,et al. LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[24] David J. Fleet,et al. Cascaded Diffusion Models for High Fidelity Image Generation , 2021, J. Mach. Learn. Res..

[25] Prafulla Dhariwal,et al. Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[26] Ronan Le Bras,et al. CLIPScore: A Reference-free Evaluation Metric for Image Captioning , 2021, EMNLP.

[27] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[28] Philipp Krähenbühl,et al. Simple Multi-dataset Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Alec Radford,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.

[30] Prafulla Dhariwal,et al. Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[31] Jing Yu Koh,et al. Cross-Modal Contrastive Learning for Text-to-Image Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Jian Sun,et al. Objects365: A Large-Scale, High-Quality Dataset for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33] Wei Chen,et al. DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Zhe Gan,et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[36] Peter Kontschieder,et al. The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[38] Dimitris N. Metaxas,et al. StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[39] Bernt Schiele,et al. Learning What and Where to Draw , 2016, NIPS.

[40] Wojciech Zaremba,et al. Improved Techniques for Training GANs , 2016, NIPS.

[41] Bernt Schiele,et al. Generative Adversarial Text to Image Synthesis , 2016, ICML.

[42] Aaron C. Courville,et al. Generative Adversarial Networks , 2014, 1406.2661.

[43] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[44] Pietro Perona,et al. The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[45] Andrew Zisserman,et al. Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[46] Mohit Bansal,et al. DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers , 2022, ArXiv.

[47] Trevor Darrell,et al. Benchmark for Compositional Text-to-Image Synthesis , 2021, NeurIPS Datasets and Benchmarks.

[48] Sixth Indian Conference on Computer Vision, Graphics & Image Processing, ICVGIP 2008, Bhubaneswar, India, 16-19 December 2008 , 2008, ICVGIP.