Cold-start Active Learning through Self-Supervised Language Modeling

Active learning strives to reduce annotation costs by choosing the most critical examples to label. Typically, the active learning strategy is contingent on the classification model. For instance, uncertainty sampling depends on poorly calibrated model confidence scores. In the cold-start setting, active learning is impractical because of model instability and data scarcity. Fortunately, modern NLP provides an additional source of information: pre-trained language models. The pre-training loss can find examples that surprise the model and should be labeled for efficient fine-tuning. Therefore, we treat the language modeling loss as a proxy for classification uncertainty. With BERT, we develop a simple strategy based on the masked language modeling loss that minimizes labeling costs for text classification. Compared to other baselines, our approach reaches higher accuracy within less sampling iterations and computation time.

[1]  Gholamreza Haffari,et al.  Learning to Actively Learn Neural Machine Translation , 2018, CoNLL.

[2]  John Langford,et al.  Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds , 2019, ICLR.

[3]  Francesco Cricri,et al.  Clustering and Unsupervised Anomaly Detection with l2 Normalized Deep Auto-Encoder Representations , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[4]  Jeff G. Schneider,et al.  Active Transfer Learning under Model Shift , 2014, ICML.

[5]  Franck Dernoncourt,et al.  PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts , 2017, IJCNLP.

[6]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[7]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[8]  Alexander M. Rush,et al.  Commonsense Knowledge Mining from Pretrained Models , 2019, EMNLP.

[9]  Tianshun Yao,et al.  Active Learning with Sampling by Uncertainty and Density for Word Sense Disambiguation and Text Classification , 2008, COLING.

[10]  Dominique Estival,et al.  Active learning for deep semantic parsing , 2018, ACL.

[11]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[12]  Ali Farhadi,et al.  Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , 2020, ArXiv.

[13]  Dawn Song,et al.  Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.

[14]  David Lowell,et al.  Practical Obstacles to Deploying Active Learning , 2019, EMNLP/IJCNLP.

[15]  Yoav Goldberg,et al.  Assessing BERT's Syntactic Abilities , 2019, ArXiv.

[16]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[17]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[18]  Daumé,et al.  Domain Adaptation meets Active Learning , 2010, HLT-NAACL 2010.

[19]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[20]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[21]  Jingbo Zhu,et al.  Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem , 2007, EMNLP.

[22]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[23]  Ryan P. Adams,et al.  On Warm-Starting Neural Network Training , 2019, NeurIPS.

[24]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[25]  Brian Mac Namee,et al.  Off to a Good Start: Using Clustering to Select the Initial Training Set in Active Learning , 2010, FLAIRS.

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[28]  Yuan Li,et al.  Learning how to Active Learn: A Deep Reinforcement Learning Approach , 2017, EMNLP.

[29]  Eric K. Ringger,et al.  Early Gains Matter: A Case for Preferring Generative over Discriminative Crowdsourcing Models , 2015, NAACL.

[30]  Yarin Gal,et al.  BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning , 2019, NeurIPS.

[31]  Sanjoy Dasgupta,et al.  Two faces of active learning , 2011, Theor. Comput. Sci..

[32]  Lehel Csató,et al.  Active Learning with Clustering , 2011, Active Learning and Experimental Design @ AISTATS.

[33]  Silvio Savarese,et al.  Active Learning for Convolutional Neural Networks: A Core-Set Approach , 2017, ICLR.

[34]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[35]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[36]  Davis Liang,et al.  Masked Language Model Scoring , 2020, ACL.

[37]  Hsuan-Tien Lin,et al.  Active Learning by Learning , 2015, AAAI.

[38]  Xiaowei Xu,et al.  Representative Sampling for Text Classification Using Support Vector Machines , 2003, ECIR.

[39]  Mark Craven,et al.  Multiple-Instance Active Learning , 2007, NIPS.

[40]  Kirk Roberts,et al.  TREC-COVID , 2020, SIGIR Forum.

[41]  Jian Su,et al.  Multi-Criteria-based Active Learning for Named Entity Recognition , 2004, ACL.

[42]  Burr Settles,et al.  Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances , 2011, EMNLP.

[43]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[44]  Jordan Boyd-Graber,et al.  Interactive Refinement of Cross-Lingual Word Embeddings , 2020, EMNLP.

[45]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[46]  Zachary C. Lipton,et al.  Deep Bayesian Active Learning for Natural Language Processing: Results of a Large-Scale Empirical Study , 2018, EMNLP.

[47]  Guodong Zhou,et al.  Active Learning for Imbalanced Sentiment Classification , 2012, EMNLP.

[48]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[49]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[50]  Anima Anandkumar,et al.  Deep Active Learning for Named Entity Recognition , 2017, Rep4NLP@ACL.

[51]  Dunja Mladenic,et al.  The Role of Hubness in Clustering High-Dimensional Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[52]  Ye Zhang,et al.  Active Discriminative Text Representation Learning , 2016, AAAI.

[53]  Dan Wang,et al.  A new active labeling method for deep learning , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[54]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[55]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[56]  Eamonn J. Keogh Nearest Neighbor , 2010, Encyclopedia of Machine Learning.

[57]  Allyson Ettinger What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models , 2019, Transactions of the Association for Computational Linguistics.