Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.

[1]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[2]  M. Halliday,et al.  Language, Context, and Text: Aspects of Language in a Social-Semiotic Perspective , 1989 .

[3]  Junyi Jessy Li,et al.  A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature , 2018, ACL.

[4]  Cecile Paris,et al.  Shot Or Not: Comparison of NLP Approaches for Vaccination Behaviour Detection , 2018, EMNLP 2018.

[5]  Sarvnaz Karimi,et al.  Cadec: A corpus of adverse drug event annotations , 2015, J. Biomed. Informatics.

[6]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[7]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[8]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[9]  Michael J. Paul,et al.  Overview of the Third Social Media Mining for Health (SMM4H) Shared Tasks at EMNLP 2018 , 2018, EMNLP 2018.

[10]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[11]  Cécile Paris,et al.  Using Similarity Measures to Select Pretraining Data for NER , 2019, NAACL.

[12]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[13]  Walter Daelemans,et al.  Using Domain Similarity for Performance Estimation , 2010, ACL 2010.

[14]  Kalina Bontcheva,et al.  Broad Twitter Corpus: A Diverse Named Entity Recognition Resource , 2016, COLING.

[15]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[16]  Rajesh Ranganath,et al.  ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission , 2019, ArXiv.

[17]  Chris Callison-Burch,et al.  PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification , 2015, ACL.

[18]  Barbara Plank,et al.  Learning to select data for transfer learning with Bayesian Optimization , 2017, EMNLP.

[19]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[20]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[21]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[22]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[23]  Suresh Manandhar,et al.  SemEval-2014 Task 4: Aspect Based Sentiment Analysis , 2014, *SEMEVAL.

[24]  Sebastian Ruder,et al.  Neural transfer learning for natural language processing , 2019 .

[25]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[26]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[27]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[30]  Li Wang,et al.  How Noisy Social Media Text, How Diffrnt Social Media Sources? , 2013, IJCNLP.

[31]  Luke S. Zettlemoyer,et al.  Cloze-driven Pretraining of Self-attention Networks , 2019, EMNLP.

[32]  Dat Quoc Nguyen,et al.  BERTweet: A pre-trained language model for English Tweets , 2020, EMNLP.