Multimodal Procedural Planning via Dual Text-Image Prompting

Embodied agents have achieved prominent performance in following human instructions to complete tasks. However, the potential of providing instructions informed by texts and images to assist humans in completing tasks remains underexplored. To uncover this capability, we present the multimodal procedural planning (MPP) task, in which models are given a high-level goal and generate plans of paired text-image steps, providing more complementary and informative guidance than unimodal plans. The key challenges of MPP are to ensure the informativeness, temporal coherence,and accuracy of plans across modalities. To tackle this, we propose Text-Image Prompting (TIP), a dual-modality prompting method that jointly leverages zero-shot reasoning ability in large language models (LLMs) and compelling text-to-image generation ability from diffusion-based models. TIP improves the interaction in the dual modalities using Text-to-Image Bridge and Image-to-Text Bridge, allowing LLMs to guide the textual-grounded image plan generation and leveraging the descriptions of image plans to ground the textual plan reversely. To address the lack of relevant datasets, we collect WIKIPLAN and RECIPEPLAN as a testbed for MPP. Our results show compelling human preferences and automatic scores against unimodal and multimodal baselines on WIKIPLAN and RECIPEPLAN in terms of informativeness, temporal coherence, and plan accuracy. Our code and data: https://github.com/YujieLu10/MPP.

[1]  William Yang Wang,et al.  Visualize Before You Write: Imagination-Guided Open-Ended Text Generation , 2022, FINDINGS.

[2]  Chan Hee Song,et al.  LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Alexei A. Efros,et al.  InstructPix2Pix: Learning to Follow Image Editing Instructions , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Duen Horng Chau,et al.  DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models , 2022, ArXiv.

[5]  Dong Yu,et al.  Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination , 2022, EMNLP.

[6]  Hou Pong Chan,et al.  Multimedia Generative Script Learning for Task Planning , 2022, arXiv.org.

[7]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[8]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[9]  Derek Hoiem,et al.  Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners , 2022, NeurIPS.

[10]  Dani Yogatama,et al.  Language Models Can See: Plugging Visual Controls in Text Generation , 2022, ArXiv.

[11]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[12]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[13]  S. Levine,et al.  Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[14]  Adrian S. Wong,et al.  Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language , 2022, ICLR.

[15]  Yaniv Taigman,et al.  Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors , 2022, ECCV.

[16]  Dongyan Zhao,et al.  Things not Written in Text: Exploring Spatial Commonsense from Visual Signals , 2022, ACL.

[17]  Jey Han Lau,et al.  An Interpretable Neuro-Symbolic Reasoning Framework for Task-Oriented Dialogue Generation , 2022, ACL.

[18]  Song-Chun Zhu,et al.  Triangular Character Animation Sampling with Motion, Emotion, and Relation , 2022, ArXiv.

[19]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[20]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[21]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[22]  P. Abbeel,et al.  Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents , 2022, ICML.

[23]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[25]  Xizhou Zhu,et al.  Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Fang Wen,et al.  Vector Quantized Diffusion Model for Text-to-Image Synthesis , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  R. Weischedel,et al.  Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals , 2021, ACL.

[28]  Lydia B. Chilton,et al.  Design Guidelines for Prompt Engineering Text-to-Image Generative Models , 2021, CHI.

[29]  Zhe Gan,et al.  An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA , 2021, AAAI.

[30]  William Yang Wang,et al.  Neuro-Symbolic Causal Language Planning with Commonsense Prompting , 2022, ArXiv.

[31]  Qingcai Chen,et al.  Semi-supervised Visual Feature Integration for Language Models through Sentence Visualization , 2021, ICMI.

[32]  Rémi Calizzano,et al.  Ordering sentences and paragraphs with pre-trained encoder-decoder transformers and pointer ensembles , 2021, DocEng.

[33]  Rajshekhar Sunderraman,et al.  Improving Text-to-Image Synthesis Using Contrastive Learning , 2021, BMVC.

[34]  Oriol Vinyals,et al.  Multimodal Few-Shot Learning with Frozen Language Models , 2021, NeurIPS.

[35]  Song-Chun Zhu,et al.  SocAoG: Incremental Graph Parsing for Social Relation Inference in Dialogues , 2021, ACL.

[36]  Chang Zhou,et al.  CogView: Mastering Text-to-Image Generation via Transformers , 2021, NeurIPS.

[37]  Ronan Le Bras,et al.  CLIPScore: A Reference-free Evaluation Metric for Image Captioning , 2021, EMNLP.

[38]  Song-Chun Zhu,et al.  Towards Socially Intelligent Agents with Mental State Transition and Human Value , 2021, SIGDIAL.

[39]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[40]  Jaemin Cho,et al.  Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.

[41]  Jing Yu Koh,et al.  Cross-Modal Contrastive Learning for Text-to-Image Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  A. Linear-probe,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021 .

[43]  Peter Alexander Jansen Visually-Grounded Planning without Vision: Language Models Infer Detailed Plans from High-level Instructions , 2020, FINDINGS.

[44]  Chris Callison-Burch,et al.  Reasoning about Goals, Steps, and Temporal Ordering with WikiHow , 2020, EMNLP.

[45]  N. Sebe,et al.  DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis , 2020, ArXiv.

[46]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[47]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[48]  Juan Carlos Niebles,et al.  Procedure Planning in Instructional Videos , 2019, ECCV.

[49]  Seungmin Seo,et al.  Topic-Guided Coherence Modeling for Sentence Ordering by Preserving Global and Local Information , 2019, EMNLP.

[50]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[51]  Wei Chen,et al.  DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Nazli Ikizler-Cinbis,et al.  RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes , 2018, EMNLP.

[53]  Claudiu Musat,et al.  Goal-Oriented Chatbot Dialog Management Bootstrapping with Transfer Learning , 2018, IJCAI.

[54]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  Zhongfei Zhang,et al.  Deep Attentive Sentence Ordering Network , 2018, EMNLP.

[56]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[57]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[58]  Xuanjing Huang,et al.  Neural Sentence Ordering , 2016, ArXiv.

[59]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[60]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[61]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[62]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[63]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[64]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.