HTLM: Hyper-Text Pre-Training and Prompting of Language Models

We introduce HTLM, a hyper-text language model trained on a large-scale web crawl. Modeling hyper-text has a number of advantages: (1) it is easily gathered at scale, (2) it provides rich document-level and end-taskadjacent supervision (e.g. class and id attributes often encode document category information), and (3) it allows for new structured prompting that follows the established semantics of HTML (e.g. to do zero-shot summarization by infilling <title> tags for a webpage that contains the input text). We show that pretraining with a BART-style denoising loss directly on simplified HTML provides highly effective transfer for a wide range of end tasks and supervision levels. HTLM matches or exceeds the performance of comparably sized text-only LMs for zero-shot prompting and fine-tuning for classification benchmarks, while also setting new state-of-the-art performance levels for zero-shot summarization. We also find that hyper-text prompts provide more value to HTLM, in terms of data efficiency, than plain text prompts do for existing LMs, and that HTLM is highly effective at autoprompting itself, by simply generating the most likely hyper-text formatting for any available training data. We will release all code and models to support future HTLM research.

[1]  Haoran Li,et al.  Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation , 2020, ArXiv.

[2]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[3]  Douwe Kiela,et al.  True Few-Shot Learning with Language Models , 2021, NeurIPS.

[4]  Judith Tonhauser,et al.  The CommitmentBank: Investigating projection in naturally occurring discourse , 2019 .

[5]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[6]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[7]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[8]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[9]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[10]  Ming-Wei Chang,et al.  BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , 2019, NAACL.

[11]  Benjamin Van Durme,et al.  Annotated Gigaword , 2012, AKBC-WEKEX@NAACL-HLT.

[12]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[13]  Alexander M. Rush,et al.  How many data points is a prompt worth? , 2021, NAACL.

[14]  Claire Gardent,et al.  The WebNLG Challenge: Generating Text from RDF Data , 2017, INLG.

[15]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[16]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[17]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Yao Zhao,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2020, ICML.

[20]  José Camacho-Collados,et al.  WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations , 2018, NAACL.

[21]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Hinrich Schutze,et al.  It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.

[23]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[24]  Anja Belz,et al.  Comparing Automatic and Human Evaluation of NLG Systems , 2006, EACL.

[25]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[26]  Li Yang,et al.  ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[27]  Armen Aghajanyan,et al.  Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning , 2020, ACL.

[28]  Verena Rieser,et al.  The E2E Dataset: New Challenges For End-to-End Generation , 2017, SIGDIAL Conference.

[29]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[30]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[31]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[32]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[33]  Tao Yu,et al.  DART: Open-Domain Structured Data Record to Text Generation , 2020, NAACL.

[34]  Gunhee Kim,et al.  Abstractive Summarization of Reddit Posts with Multi-level Memory Networks , 2018, NAACL.

[35]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.