Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models

Large language models (LLMs) have achieved remarkable progress in various natural language processing tasks with emergent abilities. However, they face inherent limitations, such as an inability to access up-to-date information, utilize external tools, or perform precise mathematical reasoning. In this paper, we introduce Chameleon, a plug-and-play compositional reasoning framework that augments LLMs to help address these challenges. Chameleon synthesizes programs to compose various tools, including LLM models, off-the-shelf vision models, web search engines, Python functions, and rule-based modules tailored to user interests. Built on top of an LLM as a natural language planner, Chameleon infers the appropriate sequence of tools to compose and execute in order to generate a final response. We showcase the adaptability and effectiveness of Chameleon on two tasks: ScienceQA and TabMWP. Notably, Chameleon with GPT-4 achieves an 86.54% accuracy on ScienceQA, significantly improving upon the best published few-shot model by 11.37%; using GPT-4 as the underlying LLM, Chameleon achieves a 17.8% increase over the state-of-the-art model, leading to a 98.78% overall accuracy on TabMWP. Further studies suggest that using GPT-4 as a planner exhibits more consistent and rational tool selection and is able to infer potential constraints given the instructions, compared to other LLMs like ChatGPT.

[1]  Yong Jae Lee,et al.  Visual Instruction Tuning , 2023, ArXiv.

[2]  Chunyuan Li,et al.  Instruction Tuning with GPT-4 , 2023, ArXiv.

[3]  Bodhisattwa Prasad Majumder,et al.  Self-Refine: Iterative Refinement with Self-Feedback , 2023, 2303.17651.

[4]  Xu Tan,et al.  HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace , 2023, ArXiv.

[5]  Hongsheng Li,et al.  LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention , 2023, ArXiv.

[6]  Faisal Ahmed,et al.  MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action , 2023, ArXiv.

[7]  Marco Tulio Ribeiro,et al.  ART: Automatic multi-step reasoning and tool-use for large language models , 2023, ArXiv.

[8]  Carl Vondrick,et al.  ViperGPT: Visual Inference via Python Execution for Reasoning , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Chenfei Wu,et al.  Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models , 2023, ArXiv.

[10]  Shima Imani,et al.  MathPrompter: Mathematical Reasoning using Large Language Models , 2023, ACL.

[11]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[12]  Michel Galley,et al.  Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback , 2023, ArXiv.

[13]  Luke Zettlemoyer,et al.  Toolformer: Language Models Can Teach Themselves to Use Tools , 2023, NeurIPS.

[14]  Alexander J. Smola,et al.  Multimodal Chain-of-Thought Reasoning in Language Models , 2023, ArXiv.

[15]  Kai-Wei Chang,et al.  A Survey of Deep Learning for Mathematical Reasoning , 2022, ArXiv.

[16]  William W. Cohen,et al.  Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , 2022, ArXiv.

[17]  Aniruddha Kembhavi,et al.  Visual Programming: Compositional visual reasoning without training , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jamie Callan,et al.  PAL: Program-aided Language Models , 2022, ICML.

[19]  Alexander M. Rush,et al.  BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.

[20]  Oyvind Tafjord,et al.  LILA: A Unified Benchmark for Mathematical Reasoning , 2022, EMNLP.

[21]  Heng Ji,et al.  Code4Struct: Code Generation for Few-Shot Structured Prediction from Natural Language , 2022, ArXiv.

[22]  Andrew M. Dai,et al.  Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[23]  Song-Chun Zhu,et al.  Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning , 2022, ICLR.

[24]  Dan Iter,et al.  Generate rather than Retrieve: Large Language Models are Strong Context Generators , 2022, ICLR.

[25]  Song-Chun Zhu,et al.  Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , 2022, NeurIPS.

[26]  Peter R. Florence,et al.  Inner Monologue: Embodied Reasoning through Planning with Language Models , 2022, CoRL.

[27]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[28]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[29]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[30]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[31]  Eric Nyberg,et al.  Open Domain Question Answering with A Unified Knowledge Interface , 2021, ACL.

[32]  Qian Liu,et al.  TAPEX: Table Pre-training via Learning a Neural SQL Executor , 2021, ICLR.

[33]  Jason Weston,et al.  Internet-Augmented Dialogue Generation , 2021, ACL.

[34]  Jeff Wu,et al.  WebGPT: Browser-assisted question-answering with human feedback , 2021, ArXiv.

[35]  Song-Chun Zhu,et al.  IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning , 2021, NeurIPS Datasets and Benchmarks.

[36]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[37]  Song-Chun Zhu,et al.  Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning , 2021, ACL.

[38]  Navin Goyal,et al.  Are NLP Models really able to Solve Simple Math Word Problems? , 2021, NAACL.

[39]  Wonjae Kim,et al.  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[40]  Cho-Jui Hsieh,et al.  What Does BERT with Vision Look At? , 2020, ACL.

[41]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[42]  Hannaneh Hajishirzi,et al.  UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[43]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[44]  Zhou Yu,et al.  Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Peng Gao,et al.  Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Chuang Gan,et al.  Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , 2018, NeurIPS.

[47]  Trevor Darrell,et al.  Explainable Neural Computation via Stack Neural Module Networks , 2018, ECCV.

[48]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[49]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Li Fei-Fei,et al.  Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  Trevor Darrell,et al.  Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[52]  Dan Klein,et al.  Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[53]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.