Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings

Recent advances in large language models elicit reasoning in a chain-of-thought that allows models to decompose problems in a human-like fashion. Though this paradigm improves multi-step reasoning ability in language models, it is limited by being unimodal and applied mainly to question-answering tasks. We claim that incorporating visual augmentation into reasoning is essential, especially for complex, imaginative tasks. Consequently, we introduce VCoT, a novel method that leverages chain-of-thought prompting with vision-language grounding to recursively bridge the logical gaps within sequential data. Our method uses visual guidance to generate synthetic multimodal infillings that add consistent and novel information to reduce the logical gaps for downstream tasks that can benefit from temporal reasoning, as well as provide interpretability into models' multi-step reasoning. We apply VCoT to the Visual Storytelling and WikiHow summarization datasets and demonstrate through human evaluation that VCoT offers novel and consistent synthetic data augmentation beating chain-of-thought baselines, which can be used to enhance downstream performance.

[1]  Michael R Douglas Large Language Models , 2023, Commun. ACM.

[2]  R. Reddy Universal Language Model Fine-Tuning for Text Classification , 2023, International Journal for Research in Applied Science and Engineering Technology.

[3]  Yanyan Lan,et al.  Visual Transformation Telling , 2023, ArXiv.

[4]  William Yang Wang,et al.  Visualize Before You Write: Imagination-Guided Open-Ended Text Generation , 2022, FINDINGS.

[5]  Peter R. Florence,et al.  Inner Monologue: Embodied Reasoning through Planning with Language Models , 2022, CoRL.

[6]  D. Schuurmans,et al.  Rationale-Augmented Ensembles in Language Models , 2022, ArXiv.

[7]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[8]  Derek Hoiem,et al.  Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners , 2022, NeurIPS.

[9]  William Yang Wang,et al.  Imagination-Augmented Natural Language Understanding , 2022, NAACL.

[10]  James L. McClelland,et al.  Can language models learn from explanations in context? , 2022, EMNLP.

[11]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[12]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[13]  P. Abbeel,et al.  Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents , 2022, ICML.

[14]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Weizhe Yuan,et al.  BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.

[16]  Song-Chun Zhu,et al.  SocAoG: Incremental Graph Parsing for Social Relation Inference in Dialogues , 2021, ACL.

[17]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[18]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[19]  Kang Min Yoo,et al.  DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances , 2020, AAAI.

[20]  Benjamin Van Durme,et al.  Temporal Reasoning in Natural Language Inference , 2020, FINDINGS.

[21]  Mohit Bansal,et al.  Vokenization: Improving Language Understanding via Contextualized, Visually-Grounded Supervision , 2020, EMNLP.

[22]  Mudit Verma,et al.  Widening the Pipeline in Human-Guided Reinforcement Learning with Explanation and Context-Aware Data Augmentation , 2020, NeurIPS.

[23]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[24]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[25]  Ali Farhadi,et al.  Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , 2020, ArXiv.

[26]  Xiaodong Liu,et al.  A Hybrid Retrieval-Generation Neural Conversation Model , 2019, CIKM.

[27]  William Yang Wang,et al.  WikiHow: A Large Scale Text Summarization Dataset , 2018, ArXiv.

[28]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[29]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Ming-Wei Chang,et al.  A Knowledge-Grounded Neural Conversation Model , 2017, AAAI.

[31]  Francis Ferraro,et al.  Visual Storytelling , 2016, NAACL.

[32]  Pradeep Bansal,et al.  Knowledge Base Inference using Bridging Entities , 2015, EMNLP.

[33]  G. Hripcsak,et al.  Methodological Review: Temporal reasoning with medical data-A review with emphasis on medical natural language processing , 2007 .

[34]  M. Just,et al.  Imagery in sentence comprehension: an fMRI study , 2004, NeuroImage.

[35]  Jürgen Dorn,et al.  Temporal Reasoning in Sequence Graphs , 1992, AAAI.

[36]  William Yang Wang,et al.  Foveate, Attribute, and Rationalize: Towards Safe and Trustworthy AI , 2022, ArXiv.

[37]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.