Understanding HTML with Large Language Models

Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks. Yet, their capabilities for HTML understanding -- i.e., parsing the raw HTML of a webpage, with applications to automation of web-based tasks, crawling, and browser-assisted retrieval -- have not been fully explored. We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks: (i) Semantic Classification of HTML elements, (ii) Description Generation for HTML inputs, and (iii) Autonomous Web Navigation of HTML pages. While previous work has developed dedicated architectures and training procedures for HTML understanding, we show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks. For instance, fine-tuned LLMs are 12% more accurate at semantic classification compared to models trained exclusively on the task dataset. Moreover, when fine-tuned on data from the MiniWoB benchmark, LLMs successfully complete 50% more tasks using 192x less data compared to the previous best supervised model. Out of the LLMs we evaluate, we show evidence that T5-based models are ideal due to their bidirectional encoder-decoder architecture. To promote further research on LLMs for HTML understanding, we create and open-source a large-scale HTML dataset distilled and auto-labeled from CommonCrawl.

[1]  Karthik Narasimhan,et al.  WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , 2022, NeurIPS.

[2]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[3]  S. Levine,et al.  Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[4]  Petko Georgiev,et al.  A data-driven approach for learning to control computers , 2022, ICML.

[5]  Furu Wei,et al.  MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding , 2021, ACL.

[6]  Michelle Chen Huebscher,et al.  Boosting Search Engines with Interactive Agents , 2021, Trans. Mach. Learn. Res..

[7]  Dmytro Okhonko,et al.  HTLM: Hyper-Text Pre-Training and Prompting of Language Models , 2021, ICLR.

[8]  Bryan A. Plummer,et al.  Interactive Mobile App Navigation with Uncertain or Under-specified Natural Language Commands , 2022, ArXiv.

[9]  Natasha Jaques,et al.  Environment Generation for Zero-Shot Compositional Reinforcement Learning , 2022, NeurIPS.

[10]  Jeff Wu,et al.  WebGPT: Browser-assisted question-answering with human feedback , 2021, ArXiv.

[11]  Charles Sutton,et al.  Program Synthesis with Large Language Models , 2021, ArXiv.

[12]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[13]  Doina Precup,et al.  AndroidEnv: A Reinforcement Learning Platform for Android , 2021, ArXiv.

[14]  Fei Huang,et al.  StructuralLM: Structural Pre-training for Form Understanding , 2021, ACL.

[15]  P. Abbeel,et al.  Pretrained Transformers as Universal Computation Engines , 2021, ArXiv.

[16]  Ruby B. Lee,et al.  ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces , 2020, AAAI.

[17]  Oriana Riva,et al.  FLIN: A Flexible Natural Language Interface for Web Navigation , 2020, NAACL.

[18]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.

[19]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[20]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[21]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[22]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[23]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[24]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[25]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[26]  Jimmy Ba,et al.  DOM-Q-NET: Grounded RL on Structured Language , 2019, ICLR.

[27]  Percy Liang,et al.  Mapping natural language commands to web elements , 2018, EMNLP.

[28]  Percy Liang,et al.  Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration , 2018, ICLR.

[29]  Percy Liang,et al.  World of Bits: An Open-Domain Platform for Web-Based Agents , 2017, ICML.

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  Kyunghyun Cho,et al.  End-to-End Goal-Driven Web Navigation , 2016, NIPS.

[32]  Oscar Díaz,et al.  User-Driven Automation of Web Form Filling , 2013, ICWE.

[33]  Kim Binsted,et al.  Web Browser Control Using EMG Based Sub Vocal Speech Recognition , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.