Rethinking Semi-supervised Learning with Language Models

Semi-supervised learning (SSL) is a popular setting aiming to effectively utilize unlabelled data to improve model performance in downstream natural language processing (NLP) tasks. Currently, there are two popular approaches to make use of unlabelled data: Self-training (ST) and Task-adaptive pre-training (TAPT). ST uses a teacher model to assign pseudo-labels to the unlabelled data, while TAPT continues pre-training on the unlabelled data before fine-tuning. To the best of our knowledge, the effectiveness of TAPT in SSL tasks has not been systematically studied, and no previous work has directly compared TAPT and ST in terms of their ability to utilize the pool of unlabelled data. In this paper, we provide an extensive empirical study comparing five state-of-the-art ST approaches and TAPT across various NLP tasks and data sizes, including in- and out-of-domain settings. Surprisingly, we find that TAPT is a strong and more robust SSL learner, even when using just a few hundred unlabelled samples or in the presence of domain shifts, compared to more sophisticated ST approaches, and tends to bring greater improvements in SSL than in fully-supervised settings. Our further analysis demonstrates the risks of using ST approaches when the size of labelled or unlabelled data is small or when domain shifts exist. We offer a fresh perspective for future SSL research, suggesting the use of unsupervised pre-training objectives over dependency on pseudo labels.

[1]  Aldo Lipani,et al.  Don't Stop Pretraining? Make Prompt-based Fine-tuning Powerful Learner , 2023, ArXiv.

[2]  L. Ein-Dor,et al.  Zero-Shot Text Classification with Self-Training , 2022, EMNLP.

[3]  B. Schiele,et al.  USB: A Unified Semi-supervised Learning Benchmark , 2022, NeurIPS.

[4]  Zejiang Hou,et al.  Meta-Learning the Difference: Preparing Large Language Models for Efficient Adaptation , 2022, Transactions of the Association for Computational Linguistics.

[5]  Yue Feng,et al.  Learning to Execute Actions or Ask Clarification Questions , 2022, NAACL-HLT.

[6]  Qiang Zhang,et al.  StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts , 2022, AAAI.

[7]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[8]  Jianmin Wang,et al.  Debiased Self-Training for Semi-Supervised Learning , 2022, NeurIPS.

[9]  Jordan Massiah,et al.  PARS: Pseudo-Label Aware Robust Sample Selection for Learning with Noisy Labels , 2022, ArXiv.

[10]  M. de Rijke,et al.  Extending CLIP for Category-to-image Retrieval in E-commerce , 2021, ECIR.

[11]  Nicholas Carlini,et al.  AdaMatch: A Unified Approach to Semi-Supervised Learning and Domain Adaptation , 2021, ICLR.

[12]  Loïc Barrault,et al.  On the Importance of Effectively Adapting Pretrained Language Models for Active Learning , 2021, ACL.

[13]  T. Shinozaki,et al.  FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling , 2021, NeurIPS.

[14]  Xifeng Yan,et al.  Task-adaptive Pre-training and Self-training are Complementary for Natural Language Understanding , 2021, EMNLP.

[15]  Rong Jin,et al.  Dash: Semi-Supervised Learning with Dynamic Thresholding , 2021, ICML.

[16]  Mingsheng Long,et al.  Self-Tuning for Data-Efficient Deep Learning , 2021, ICML.

[17]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[18]  Gabriel Synnaeve,et al.  Self-Training and Pre-Training are Complementary for Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.

[20]  Timo Schick,et al.  Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference , 2020, EACL.

[21]  Jihong Ouyang,et al.  Semi-Supervised Text Classification with Balanced Deep Representation Distributions , 2021, ACL.

[22]  Chao Zhang,et al.  Weakly-Supervised Text Classification Using Label Names Only , 2020, EMNLP.

[23]  Diyi Yang,et al.  Local Additivity Based Data Augmentation for Semi-supervised NER , 2020, EMNLP.

[24]  Dilek Z. Hakkani-Tür,et al.  DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue , 2020, ArXiv.

[25]  Subhabrata Mukherjee,et al.  Uncertainty-aware Self-training for Few-shot Text Classification , 2020, NeurIPS.

[26]  Quoc V. Le,et al.  Rethinking Pre-training and Self-training , 2020, NeurIPS.

[27]  Barbara Plank,et al.  Neural Unsupervised Domain Adaptation in NLP—A Survey , 2020, COLING.

[28]  Diyi Yang,et al.  MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification , 2020, ACL.

[29]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[30]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[31]  David Berthelot,et al.  FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , 2020, NeurIPS.

[32]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Marc'Aurelio Ranzato,et al.  Revisiting Self-Training for Neural Sequence Generation , 2019, ICLR.

[34]  Quoc V. Le,et al.  Unsupervised Data Augmentation for Consistency Training , 2019, NeurIPS.

[35]  Nicholas Carlini,et al.  ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring , 2019, ArXiv.

[36]  Kathleen McKeown,et al.  A Good Sample is Hard to Find: Noise Injection Sampling and Self-Training for Neural Language Generation Models , 2019, INLG.

[37]  Mirella Lapata,et al.  Semi-Supervised Semantic Role Labeling with Cross-View Training , 2019, EMNLP.

[38]  Gerard de Melo,et al.  A Robust Self-Learning Framework for Cross-Lingual Text Classification , 2019, EMNLP.

[39]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[40]  Luo Si,et al.  Semi-supervised Domain Adaptation for Dependency Parsing , 2019, ACL.

[41]  Ming-Wei Chang,et al.  Zero-Shot Entity Linking by Reading Entity Descriptions , 2019, ACL.

[42]  Noah A. Smith,et al.  Variational Pretraining for Semi-supervised Text Classification , 2019, ACL.

[43]  David Berthelot,et al.  MixMatch: A Holistic Approach to Semi-Supervised Learning , 2019, NeurIPS.

[44]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[45]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[46]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[47]  Shin Ishii,et al.  Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[49]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[50]  Quoc V. Le,et al.  Semi-Supervised Sequence Modeling with Cross-View Training , 2018, EMNLP.

[51]  Eneko Agirre,et al.  A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[52]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[53]  Barbara Plank,et al.  Strong Baselines for Neural Semi-Supervised Learning under Domain Shift , 2018, ACL.

[54]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[55]  Tatsuya Harada,et al.  Maximum Classifier Discrepancy for Unsupervised Domain Adaptation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[57]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[58]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[59]  Jure Leskovec,et al.  Hidden factors and hidden topics: understanding rating dimensions with review text , 2013, RecSys.

[60]  Dong-Hyun Lee,et al.  Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , 2013 .

[61]  Guodong Zhou,et al.  Semi-Supervised Learning for Imbalanced Sentiment Classification , 2011, IJCAI.

[62]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[63]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[64]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[65]  Ming-Wei Chang,et al.  Importance of Semantic Representation: Dataless Classification , 2008, AAAI.

[66]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[67]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[68]  Yoshua Bengio,et al.  Semi-supervised Learning by Entropy Minimization , 2004, CAP.

[69]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.