Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference

Some NLP tasks can be solved in a fully unsupervised fashion by providing a pretrained language model with “task descriptions” in natural language (e.g., Radford et al., 2019). While this approach underperforms its supervised counterpart, we show in this work that the two ideas can be combined: We introduce Pattern-Exploiting Training (PET), a semi-supervised training procedure that reformulates input examples as cloze-style phrases to help language models understand a given task. These phrases are then used to assign soft labels to a large set of unlabeled examples. Finally, standard supervised training is performed on the resulting training set. For several tasks and languages, PET outperforms supervised training and strong semi-supervised approaches in low-resource settings by a large margin.

[1]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[2]  Huajun Chen,et al.  Zero-shot Text Classification via Reinforced Self-training , 2020, ACL.

[3]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[4]  Juri Opitz,et al.  Argumentative Relation Classification as Plausibility Ranking , 2019, KONVENS.

[5]  Zhou Yu,et al.  Domain Adaptive Dialog Generation via Meta Learning , 2019, ACL.

[6]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[7]  Dan Roth,et al.  Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach , 2019, EMNLP.

[8]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[9]  Ronan Le Bras,et al.  Unsupervised Commonsense Question Answering with Self-Talk , 2020, EMNLP.

[10]  Zi-Yi Dou,et al.  Investigating Meta-Learning Algorithms for Low-Resource Natural Language Understanding Tasks , 2019, EMNLP.

[11]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[12]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[13]  Steven Schockaert,et al.  Inducing Relational Knowledge from BERT , 2019, AAAI.

[14]  Davidson He Does This Make Sense? , 2015, The Consultant pharmacist : the journal of the American Society of Consultant Pharmacists.

[15]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[16]  Philip H. S. Torr,et al.  An embarrassingly simple approach to zero-shot learning , 2015, ICML.

[18]  Hinrich Schutze,et al.  Negated LAMA: Birds cannot fly , 2019, ArXiv.

[19]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[20]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[21]  Yejin Choi,et al.  WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale , 2020, AAAI.

[22]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[23]  Allyson Ettinger,et al.  What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models , 2019, TACL.

[24]  Alexandros Potamianos,et al.  An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models , 2019, NAACL.

[25]  Bryan Catanzaro,et al.  Zero-shot Text Classification With Generative Language Models , 2019, ArXiv.

[26]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[27]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[28]  Xuanjing Huang,et al.  How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[29]  Mário J. Silva,et al.  Semi-Supervised Bootstrapping of Relationship Extractors with Distributional Semantics , 2015, EMNLP.

[30]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[31]  Tom M. Mitchell,et al.  Zero-shot Learning of Classifiers from Natural Language Quantification , 2018, ACL.

[32]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[33]  Alexander M. Rush,et al.  Commonsense Knowledge Mining from Pretrained Models , 2019, EMNLP.

[34]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[35]  Yu Cheng,et al.  Diverse Few-Shot Text Classification with Multiple Metrics , 2018, NAACL.

[36]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[37]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[38]  Quoc V. Le,et al.  A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[39]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[40]  Johannes Fürnkranz,et al.  Using semantic similarity for multi-label zero-shot classification of text documents , 2016, ESANN.

[41]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[42]  Jiajun Shen,et al.  Revisiting Self-Training for Neural Sequence Generation , 2020, ICLR.

[43]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[44]  Quoc V. Le,et al.  Unsupervised Data Augmentation for Consistency Training , 2019, NeurIPS.

[45]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[46]  Graham Neubig,et al.  How Can We Know What Language Models Know? , 2019, Transactions of the Association for Computational Linguistics.

[47]  Jonathan Berant,et al.  oLMpics-On What Language Model Pre-training Captures , 2019, Transactions of the Association for Computational Linguistics.

[48]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[49]  Diyi Yang,et al.  MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification , 2020, ACL.

[50]  Ari Rappoport,et al.  Self-Training for Enhancement and Domain Adaptation of Statistical Parsers Trained on Small Datasets , 2007, ACL.

[51]  Hinrich Schütze,et al.  Rare Words: A Major Problem for Contextualized Embeddings And How to Fix it by Attentive Mimicking , 2019, AAAI.

[52]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[53]  Yue Zhang,et al.  Does it Make Sense? And Why? A Pilot Study for Sense Making and Explanation , 2019, ACL.

[54]  Mary P. Harper,et al.  Self-Training PCFG Grammars with Latent Annotations Across Languages , 2009, EMNLP.