论文信息 - A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis - 字舞流文

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web navigation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that can complete the tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via generated Python programs from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our recipe improves the success on a real website by over 50%, and that HTML-T5 is the best model to solve HTML-based tasks; achieving 14.9% higher success rate than prior SoTA on the MiniWoB web navigation benchmark and better accuracy on offline task planning evaluation.

Aleksandra Faust | D. Eck | Izzeddin Gur | Hiroki Furuta | Yutaka Matsuo | Mustafa Safdari | Austin Huang

[1] Bo An,et al. Synapse: Leveraging Few-Shot Exemplars for Human-Level Computer Control , 2023, ArXiv.

[2] Huan Sun,et al. Mind2Web: Towards a Generalist Agent for the Web , 2023, ArXiv.

[3] Kristina Toutanova,et al. From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces , 2023, ArXiv.

[4] Chao Zhang,et al. AdaPlanner: Adaptive Planning from Feedback with Language Models , 2023, ArXiv.

[5] Yuke Zhu,et al. Voyager: An Open-Ended Embodied Agent with Large Language Models , 2023, Trans. Mach. Learn. Res..

[6] S. Gu,et al. Multimodal Web Navigation with Instruction-Finetuned Foundation Models , 2023, ArXiv.

[7] J. Tenenbaum,et al. Generalized Planning in PDDL Domains with Pretrained Large Language Models , 2023, AAAI.

[8] Andrew M. Dai,et al. PaLM 2 Technical Report , 2023, ArXiv.

[9] Julian McAuley,et al. Small Models are Valuable Plug-ins for Large Language Models , 2023, ArXiv.

[10] B. Liu,et al. LLM+P: Empowering Large Language Models with Optimal Planning Proficiency , 2023, ArXiv.

[11] P. Baldi,et al. Language Models can Solve Computer Tasks , 2023, NeurIPS.

[12] David C. Uthus,et al. CoLT5: Faster Long-Range Transformers with Conditional Computation , 2023, ArXiv.

[13] Henrique Pondé de Oliveira Pinto,et al. GPT-4 Technical Report , 2023, 2303.08774.

[14] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[15] Luke Zettlemoyer,et al. Toolformer: Language Models Can Teach Themselves to Use Tools , 2023, NeurIPS.

[16] Yitao Liang,et al. Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents , 2023, ArXiv.

[17] Quoc V. Le,et al. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning , 2023, ICML.

[18] Yejin Choi,et al. Do Embodied Agents Dream of Pixelated Sheep?: Embodied Decision Making using Language Guided World Modelling , 2023, ICML.

[19] Jamie Callan,et al. PAL: Program-aided Language Models , 2022, ICML.

[20] Andrew M. Dai,et al. Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[21] Quoc V. Le,et al. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , 2022, ACL.

[22] Ofir Nachum,et al. Understanding HTML with Large Language Models , 2022, ArXiv.

[23] I. Shafran,et al. ReAct: Synergizing Reasoning and Acting in Language Models , 2022, ICLR.

[24] D. Fox,et al. ProgPrompt: Generating Situated Robot Task Plans using Large Language Models , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[25] Peter R. Florence,et al. Code as Policies: Language Model Programs for Embodied Control , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[26] Karthik Narasimhan,et al. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , 2022, NeurIPS.

[27] S. Sreedharan,et al. Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change) , 2022, ArXiv.

[28] J. Dean,et al. Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[29] S. Gu,et al. Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[30] Yao Zhao,et al. TALM: Tool Augmented Language Models , 2022, ArXiv.

[31] Lu Chen,et al. TIE: Topological Information Enhanced Structural Reading Comprehension on Web Pages , 2022, NAACL.

[32] Hyung Won Chung,et al. UL2: Unifying Language Learning Paradigms , 2022, ICLR.

[33] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[34] S. Levine,et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[35] Adrian S. Wong,et al. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language , 2022, ICLR.

[36] Marc van Zee,et al. Scaling Up Models and Data with t5x and seqio , 2022, J. Mach. Learn. Res..

[37] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.

[38] Petko Georgiev,et al. A data-driven approach for learning to control computers , 2022, ICML.

[39] Cherepanov,et al. Competition-level code generation with AlphaCode , 2022, Science.

[40] Anirudh Ravula,et al. WebFormer: The Web-page Transformer for Structure Information Extraction , 2022, WWW.

[41] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[42] Renelito Delos Santos,et al. LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[43] P. Abbeel,et al. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents , 2022, ICML.

[44] David C. Uthus,et al. LongT5: Efficient Text-To-Text Transformer for Long Sequences , 2021, NAACL-HLT.

[45] Mohammad Bavarian,et al. Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[46] Furu Wei,et al. MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding , 2021, ACL.

[47] Yue Wang,et al. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation , 2021, EMNLP.

[48] Michelle Chen Huebscher,et al. Boosting Search Engines with Interactive Agents , 2021, Trans. Mach. Learn. Res..

[49] Joseph J. Lim,et al. Learning to Synthesize Programs as Interpretable and Generalizable Policies , 2021, NeurIPS.

[50] Charles Sutton,et al. Program Synthesis with Large Language Models , 2021, ArXiv.

[51] Wojciech Zaremba,et al. Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[52] Bhargava Urala Kota,et al. DocFormer: End-to-End Transformer for Document Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[53] Hongfu Liu,et al. SelfDoc: Self-Supervised Document Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54] Sheila A. McIlraith,et al. AppBuddy: Learning to Accomplish Tasks in Mobile Apps via Reinforcement Learning , 2021, Canadian Conference on AI.

[55] Doina Precup,et al. AndroidEnv: A Reinforcement Learning Platform for Android , 2021, ArXiv.

[56] Fei Huang,et al. StructuralLM: Structural Pre-training for Form Understanding , 2021, ACL.

[57] Dawn Song,et al. Measuring Coding Challenge Competence With APPS , 2021, NeurIPS Datasets and Benchmarks.

[58] Brian Lester,et al. The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[59] Navin Goyal,et al. Are NLP Models really able to Solve Simple Math Word Problems? , 2021, NAACL.

[60] Yang Liu,et al. MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization , 2021, NAACL.

[61] Neel Sundaresan,et al. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation , 2021, NeurIPS Datasets and Benchmarks.

[62] Lu Chen,et al. WebSRC: A Dataset for Web-Based Structural Reading Comprehension , 2021, Conference on Empirical Methods in Natural Language Processing.

[63] Jonathan Berant,et al. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , 2021, Transactions of the Association for Computational Linguistics.

[64] Oriana Riva,et al. FLIN: A Flexible Natural Language Interface for Web Navigation , 2020, NAACL.

[65] Dawn Song,et al. Measuring Massive Multitask Language Understanding , 2020, ICLR.

[66] Keh-Yih Su,et al. A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers , 2020, ACL.

[67] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[68] Xin Zhou,et al. Mapping Natural Language Instructions to Mobile UI Action Sequences , 2020, ACL.

[69] Li Yang,et al. ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[70] Ting Liu,et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, FINDINGS.

[71] Furu Wei,et al. LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.

[72] Peter J. Liu,et al. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2019, ICML.

[73] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[74] Lu Wang,et al. BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization , 2019, ACL.

[75] Dragomir R. Radev,et al. Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model , 2019, ACL.

[76] Jimmy Ba,et al. DOM-Q-NET: Grounded RL on Structured Language , 2019, ICLR.

[77] Dilek Z. Hakkani-Tür,et al. Learning to Navigate the Web , 2018, ICLR.

[78] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[79] Franck Dernoncourt,et al. A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents , 2018, NAACL.

[80] Percy Liang,et al. Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration , 2018, ICLR.

[81] Percy Liang,et al. World of Bits: An Open-Domain Platform for Web-Based Agents , 2017, ICML.

[82] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[83] Bowen Zhou,et al. SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents , 2016, AAAI.

[84] Oscar Díaz,et al. User-Driven Automation of Web Form Filling , 2013, ICWE.

[85] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[86] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[87] Vinh Q. Tran,et al. Unifying Language Learning Paradigms , 2022, ArXiv.

[88] Jonathan Berant,et al. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.