Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

Language models pretrained on text from a wide variety of sources form the foundation of today's NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task's unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multi-phase adaptive pretraining offers large gains in task performance.

[1]  Kathleen McKeown,et al.  IMHO Fine-Tuning Improves Claim Detection , 2019, NAACL.

[2]  Kevin Duh,et al.  Curriculum Learning for Domain Adaptation in Neural Machine Translation , 2019, NAACL.

[3]  Tudor I. Oprea,et al.  ChemProt-3.0: a global chemical biology diseases mapping , 2016, Database J. Biol. Databases Curation.

[4]  Alexandros Potamianos,et al.  An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models , 2019, NAACL.

[5]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Kyle Lo,et al.  S2ORC: The Semantic Scholar Open Research Corpus , 2020, ACL.

[7]  Noah A. Smith,et al.  Shallow Syntax in Deep Water , 2019, ArXiv.

[8]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Barbara Plank,et al.  What to do about non-standard (or non-canonical) language in NLP , 2016, KONVENS.

[11]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[12]  Omer Levy,et al.  Generalization through Memorization: Nearest Neighbor Language Models , 2020, ICLR.

[13]  Eric P. Xing,et al.  Diffusion of Lexical Change in Social Media , 2012, PloS one.

[14]  Christian Biemann,et al.  Domain-Specific Corpus Expansion with Focused Webcrawling , 2016, LREC.

[15]  Tanasanee Phienthrakul,et al.  Sentiment Classification Using Document Embeddings Trained with Cosine Similarity , 2019, ACL.

[16]  Roy Schwartz,et al.  Show Your Work: Improved Reporting of Experimental Results , 2019, EMNLP.

[17]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[18]  Bhavana Dalvi,et al.  Pretrained Language Models for Sequential Sentence Classification , 2019, EMNLP/IJCNLP.

[19]  Mari Ostendorf,et al.  Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction , 2018, EMNLP.

[20]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[21]  Chitta Baral,et al.  Exploring ways to incorporate additional knowledge to improve Natural Language Commonsense Question Answering , 2019, ArXiv.

[22]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[23]  John G. Breslin,et al.  Towards a continuous modeling of natural language domains , 2016, ArXiv.

[24]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[25]  Zhiyong Lu,et al.  The CHEMDNER corpus of chemicals and drugs and its annotation principles , 2015, Journal of Cheminformatics.

[26]  Ming-Wei Chang,et al.  Zero-Shot Entity Linking by Reading Entity Descriptions , 2019, ACL.

[27]  Lei Yu,et al.  Learning and Evaluating General Linguistic Intelligence , 2019, ArXiv.

[28]  Franck Dernoncourt,et al.  PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts , 2017, IJCNLP.

[29]  Jacob Eisenstein,et al.  Unsupervised Domain Adaptation of Contextualized Embeddings for Sequence Labeling , 2019, EMNLP.

[30]  Noah A. Smith,et al.  Variational Pretraining for Semi-supervised Text Classification , 2019, ACL.

[31]  Cécile Paris,et al.  Using Similarity Measures to Select Pretraining Data for NER , 2019, NAACL.

[32]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[33]  Anton van den Hengel,et al.  Image-Based Recommendations on Styles and Substitutes , 2015, SIGIR.

[34]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[35]  Philip S. Yu,et al.  Review Conversational Reading Comprehension , 2019, ArXiv.

[36]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[37]  Benno Stein,et al.  SemEval-2019 Task 4: Hyperpartisan News Detection , 2019, *SEMEVAL.

[38]  Luke S. Zettlemoyer,et al.  Cloze-driven Pretraining of Self-attention Networks , 2019, EMNLP.

[39]  Quoc V. Le,et al.  A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[40]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[41]  Roee Aharoni,et al.  Unsupervised Domain Clusters in Pretrained Language Models , 2020, ACL.

[42]  David Y. W. Lee,et al.  Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle , 2001 .

[43]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[44]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[45]  Rajesh Ranganath,et al.  ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission , 2019, ArXiv.

[46]  Barbara Plank,et al.  Learning to select data for transfer learning with Bayesian Optimization , 2017, EMNLP.

[47]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[48]  Anália Lourenço,et al.  Overview of the BioCreative VI chemical-protein interaction Track , 2017 .

[49]  Wouter Weerkamp,et al.  What’s in a Domain? Analyzing Genre and Topic Differences in Statistical Machine Translation , 2015, ACL.

[50]  Philip S. Yu,et al.  BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis , 2019, NAACL.

[51]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[52]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[53]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[54]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[55]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[56]  Xuanjing Huang,et al.  How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[57]  Daniel Jurafsky,et al.  Measuring the Evolution of a Scientific Field through Citation Frames , 2018, TACL.

[58]  Samuel R. Bowman,et al.  Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , 2018, ArXiv.

[59]  Julian J. McAuley,et al.  Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering , 2016, WWW.

[60]  Noah A. Smith,et al.  To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks , 2019, RepL4NLP@ACL.

[61]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[62]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[63]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[64]  Daniel King,et al.  ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing , 2019, BioNLP@ACL.

[65]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[66]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[67]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[68]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.