I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors

Visual metaphors are powerful rhetorical devices used to persuade or communicate creative ideas through images. Similar to linguistic metaphors, they convey meaning implicitly through symbolism and juxtaposition of the symbols. We propose a new task of generating visual metaphors from linguistic metaphors. This is a challenging task for diffusion-based text-to-image models, such as DALL$\cdot$E 2, since it requires the ability to model implicit meaning and compositionality. We propose to solve the task through the collaboration between Large Language Models (LLMs) and Diffusion Models: Instruct GPT-3 (davinci-002) with Chain-of-Thought prompting generates text that represents a visual elaboration of the linguistic metaphor containing the implicit meaning and relevant objects, which is then used as input to the diffusion-based text-to-image models.Using a human-AI collaboration framework, where humans interact both with the LLM and the top-performing diffusion model, we create a high-quality dataset containing 6,476 visual metaphors for 1,540 linguistic metaphors and their associated visual elaborations. Evaluation by professional illustrators shows the promise of LLM-Diffusion Model collaboration for this task . To evaluate the utility of our Human-AI collaboration framework and the quality of our dataset, we perform both an intrinsic human-based evaluation and an extrinsic evaluation using visual entailment as a downstream task.

[1]  EunJeong Hwang,et al.  MemeCap: A Dataset for Captioning and Interpreting Memes , 2023, ArXiv.

[2]  Lydia B. Chilton,et al.  ReelFramer: Co-creating News Reels on Social Media with Generative AI , 2023, ArXiv.

[3]  Lydia B. Chilton,et al.  Generative Disco: Text-to-Video Generation for Music Visualization , 2023, ArXiv.

[4]  Dafna Shahaf,et al.  IRFL: Image Recognition of Figurative Language , 2023, EMNLP.

[5]  L. Guibas,et al.  MetaCLUE: Towards Comprehensive Visual Metaphors Research , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  William Yang Wang,et al.  Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis , 2022, ICLR.

[7]  Alexei A. Efros,et al.  InstructPix2Pix: Learning to Follow Image Editing Instructions , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Tuhin Chakrabarty,et al.  Help me write a Poem - Instruction Tuning as a Vehicle for Collaborative Poetry Writing , 2022, EMNLP.

[9]  Evelina Leivada,et al.  DALL-E 2 Fails to Reliably Capture Common Syntactic Processes , 2022, ArXiv.

[10]  Cristina Luna Jiménez,et al.  Language Does More Than Describe: On The Lack Of Figurative Speech in Text-To-Image Models , 2022, ArXiv.

[11]  Jason Baldridge,et al.  Underspecification in Scene Description-to-Depiction Tasks , 2022, AACL.

[12]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[13]  Debanjan Ghosh,et al.  FLUTE: Figurative Language Understanding through Textual Explanations , 2022, EMNLP.

[14]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[15]  Graham Neubig,et al.  Testing the Ability of Language Models to Interpret Figurative Language , 2022, NAACL.

[16]  Lydia B. Chilton,et al.  Opal: Multimodal Image Generation for News Illustration , 2022, UIST.

[17]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[18]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[19]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[20]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[21]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Lydia B. Chilton,et al.  Design Guidelines for Prompt Engineering Text-to-Image Generative Models , 2021, CHI.

[23]  Marzena Karpinska,et al.  The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation , 2021, EMNLP.

[24]  S. Muresan,et al.  MERMAID: Metaphor Generation with Symbolism and Discriminative Decoding , 2021, NAACL.

[25]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[26]  Maks Ovsjanikov,et al.  ArtEmis: Affective Language for Visual Art , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Frank F. Xu,et al.  How Can We Know What Language Models Know? , 2019, Transactions of the Association for Computational Linguistics.

[28]  Lydia B. Chilton,et al.  Human Errors in Interpreting Visual Metaphor , 2019, Creativity & Cognition.

[29]  Xiaojun Wan,et al.  How to Avoid Sentences Spelling Boring? Towards a Neural Approach to Unsupervised Metaphor Generation , 2019, NAACL.

[30]  Lydia B. Chilton,et al.  VisiBlends: A Flexible Workflow for Visual Blends , 2019, CHI.

[31]  Asim Kadav,et al.  Visual Entailment: A Novel Task for Fine-Grained Image Understanding , 2019, ArXiv.

[32]  Shalom Lappin,et al.  Predicting Human Metaphor Paraphrase Judgments with Deep Neural Networks , 2018, Fig-Lang@NAACL-HLT.

[33]  Rachel Rudinger,et al.  Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation , 2018, BlackboxNLP@EMNLP.

[34]  Mingda Zhang,et al.  Automatic Understanding of Image and Video Advertisements , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[36]  Tyler Marghetis,et al.  Literal and Metaphorical Senses in Compositional Distributional Semantic Models , 2016, ACL.

[37]  T. Veale Round Up The Usual Suspects: Knowledge-Based Metaphor Generation , 2016 .

[38]  Yulia Tsvetkov,et al.  Metaphor Detection with Cross-Lingual Model Transfer , 2014, ACL.

[39]  Bipin Indurkhya,et al.  An Empirical Study on the Role of Perceptual Similarity in Visual Metaphors and Creativity , 2013 .

[40]  Asuka Terai,et al.  A Computational System of Metaphor Generation with Evaluation Mechanism , 2010, ICANN.

[41]  Edward F. McQuarrie,et al.  Beyond Visual Metaphor: A New Typology of Visual Rhetoric in Advertising , 2004 .

[42]  Barbara J. Phillips Understanding Visual Metaphor in Advertising , 2003 .

[43]  Edward F. McQuarrie,et al.  Visual Rhetoric in Advertising: Text-Interpretive, Experimental, and Reader-Response Analyses , 1999 .

[44]  G. Lakoff The Contemporary Theory of Metaphor , 1993 .

[45]  Hongfei Lin,et al.  MultiMET: A Multimodal Dataset for Metaphor Understanding , 2021, ACL.

[46]  A. Linear-probe,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021 .

[47]  Keiga Abe,et al.  A Computational Model of the Metaphor Generation Process , 2006 .

[48]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .