A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web navigation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that can complete the tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via generated Python programs from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our recipe improves the success on a real website by over 50%, and that HTML-T5 is the best model to solve HTML-based tasks; achieving 14.9% higher success rate than prior SoTA on the MiniWoB web navigation benchmark and better accuracy on offline task planning evaluation.

[1]  Bo An,et al.  Synapse: Leveraging Few-Shot Exemplars for Human-Level Computer Control , 2023, ArXiv.

[2]  Huan Sun,et al.  Mind2Web: Towards a Generalist Agent for the Web , 2023, ArXiv.

[3]  Kristina Toutanova,et al.  From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces , 2023, ArXiv.

[4]  Chao Zhang,et al.  AdaPlanner: Adaptive Planning from Feedback with Language Models , 2023, ArXiv.

[5]  Yuke Zhu,et al.  Voyager: An Open-Ended Embodied Agent with Large Language Models , 2023, Trans. Mach. Learn. Res..

[6]  S. Gu,et al.  Multimodal Web Navigation with Instruction-Finetuned Foundation Models , 2023, ArXiv.

[7]  J. Tenenbaum,et al.  Generalized Planning in PDDL Domains with Pretrained Large Language Models , 2023, AAAI.

[8]  Andrew M. Dai,et al.  PaLM 2 Technical Report , 2023, ArXiv.

[9]  Julian McAuley,et al.  Small Models are Valuable Plug-ins for Large Language Models , 2023, ArXiv.

[10]  B. Liu,et al.  LLM+P: Empowering Large Language Models with Optimal Planning Proficiency , 2023, ArXiv.

[11]  P. Baldi,et al.  Language Models can Solve Computer Tasks , 2023, NeurIPS.

[12]  David C. Uthus,et al.  CoLT5: Faster Long-Range Transformers with Conditional Computation , 2023, ArXiv.

[13]  Henrique Pondé de Oliveira Pinto,et al.  GPT-4 Technical Report , 2023, 2303.08774.

[14]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[15]  Luke Zettlemoyer,et al.  Toolformer: Language Models Can Teach Themselves to Use Tools , 2023, NeurIPS.

[16]  Yitao Liang,et al.  Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents , 2023, ArXiv.

[17]  Quoc V. Le,et al.  The Flan Collection: Designing Data and Methods for Effective Instruction Tuning , 2023, ICML.

[18]  Yejin Choi,et al.  Do Embodied Agents Dream of Pixelated Sheep?: Embodied Decision Making using Language Guided World Modelling , 2023, ICML.

[19]  Jamie Callan,et al.  PAL: Program-aided Language Models , 2022, ICML.

[20]  Andrew M. Dai,et al.  Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[21]  Quoc V. Le,et al.  Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , 2022, ACL.

[22]  Ofir Nachum,et al.  Understanding HTML with Large Language Models , 2022, ArXiv.

[23]  I. Shafran,et al.  ReAct: Synergizing Reasoning and Acting in Language Models , 2022, ICLR.

[24]  D. Fox,et al.  ProgPrompt: Generating Situated Robot Task Plans using Large Language Models , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Peter R. Florence,et al.  Code as Policies: Language Model Programs for Embodied Control , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[26]  Karthik Narasimhan,et al.  WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , 2022, NeurIPS.

[27]  S. Sreedharan,et al.  Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change) , 2022, ArXiv.

[28]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[29]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[30]  Yao Zhao,et al.  TALM: Tool Augmented Language Models , 2022, ArXiv.

[31]  Lu Chen,et al.  TIE: Topological Information Enhanced Structural Reading Comprehension on Web Pages , 2022, NAACL.

[32]  Hyung Won Chung,et al.  UL2: Unifying Language Learning Paradigms , 2022, ICLR.

[33]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[34]  S. Levine,et al.  Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[35]  Adrian S. Wong,et al.  Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language , 2022, ICLR.

[36]  Marc van Zee,et al.  Scaling Up Models and Data with t5x and seqio , 2022, J. Mach. Learn. Res..

[37]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[38]  Petko Georgiev,et al.  A data-driven approach for learning to control computers , 2022, ICML.

[39]  Cherepanov,et al.  Competition-level code generation with AlphaCode , 2022, Science.

[40]  Anirudh Ravula,et al.  WebFormer: The Web-page Transformer for Structure Information Extraction , 2022, WWW.

[41]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[42]  Renelito Delos Santos,et al.  LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[43]  P. Abbeel,et al.  Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents , 2022, ICML.

[44]  David C. Uthus,et al.  LongT5: Efficient Text-To-Text Transformer for Long Sequences , 2021, NAACL-HLT.

[45]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[46]  Furu Wei,et al.  MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding , 2021, ACL.

[47]  Yue Wang,et al.  CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation , 2021, EMNLP.

[48]  Michelle Chen Huebscher,et al.  Boosting Search Engines with Interactive Agents , 2021, Trans. Mach. Learn. Res..

[49]  Joseph J. Lim,et al.  Learning to Synthesize Programs as Interpretable and Generalizable Policies , 2021, NeurIPS.

[50]  Charles Sutton,et al.  Program Synthesis with Large Language Models , 2021, ArXiv.

[51]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[52]  Bhargava Urala Kota,et al.  DocFormer: End-to-End Transformer for Document Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Hongfu Liu,et al.  SelfDoc: Self-Supervised Document Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Sheila A. McIlraith,et al.  AppBuddy: Learning to Accomplish Tasks in Mobile Apps via Reinforcement Learning , 2021, Canadian Conference on AI.

[55]  Doina Precup,et al.  AndroidEnv: A Reinforcement Learning Platform for Android , 2021, ArXiv.

[56]  Fei Huang,et al.  StructuralLM: Structural Pre-training for Form Understanding , 2021, ACL.

[57]  Dawn Song,et al.  Measuring Coding Challenge Competence With APPS , 2021, NeurIPS Datasets and Benchmarks.

[58]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[59]  Navin Goyal,et al.  Are NLP Models really able to Solve Simple Math Word Problems? , 2021, NAACL.

[60]  Yang Liu,et al.  MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization , 2021, NAACL.

[61]  Neel Sundaresan,et al.  CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation , 2021, NeurIPS Datasets and Benchmarks.

[62]  Lu Chen,et al.  WebSRC: A Dataset for Web-Based Structural Reading Comprehension , 2021, Conference on Empirical Methods in Natural Language Processing.

[63]  Jonathan Berant,et al.  Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , 2021, Transactions of the Association for Computational Linguistics.

[64]  Oriana Riva,et al.  FLIN: A Flexible Natural Language Interface for Web Navigation , 2020, NAACL.

[65]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[66]  Keh-Yih Su,et al.  A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers , 2020, ACL.

[67]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[68]  Xin Zhou,et al.  Mapping Natural Language Instructions to Mobile UI Action Sequences , 2020, ACL.

[69]  Li Yang,et al.  ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[70]  Ting Liu,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, FINDINGS.

[71]  Furu Wei,et al.  LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.

[72]  Peter J. Liu,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2019, ICML.

[73]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[74]  Lu Wang,et al.  BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization , 2019, ACL.

[75]  Dragomir R. Radev,et al.  Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model , 2019, ACL.

[76]  Jimmy Ba,et al.  DOM-Q-NET: Grounded RL on Structured Language , 2019, ICLR.

[77]  Dilek Z. Hakkani-Tür,et al.  Learning to Navigate the Web , 2018, ICLR.

[78]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[79]  Franck Dernoncourt,et al.  A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents , 2018, NAACL.

[80]  Percy Liang,et al.  Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration , 2018, ICLR.

[81]  Percy Liang,et al.  World of Bits: An Open-Domain Platform for Web-Based Agents , 2017, ICML.

[82]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[83]  Bowen Zhou,et al.  SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents , 2016, AAAI.

[84]  Oscar Díaz,et al.  User-Driven Automation of Web Form Filling , 2013, ICWE.

[85]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[86]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[87]  Vinh Q. Tran,et al.  Unifying Language Learning Paradigms , 2022, ArXiv.

[88]  Jonathan Berant,et al.  CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.