Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data

For natural language processing (NLP) tasks such as sentiment or topic classification, currently prevailing approaches heavily rely on pretraining large self-supervised models on massive external data resources. However, this methodology is being critiqued for: exceptional compute and pretraining data requirements; diminishing returns on both large and small datasets; and importantly, favourable evaluation settings that overestimate performance differences. The core belief behind current methodology, coined `the bitter lesson' by R. Sutton, is that `compute scale-up beats data and compute-efficient algorithms', neglecting that progress in compute hardware scale-up is based almost entirely on the miniaturisation of resource consumption. We thus approach pretraining from a miniaturisation perspective, such as not to require massive external data sources and models, or learned translations from continuous input embeddings to discrete labels. To minimise overly favourable evaluation, we examine learning on a long-tailed, low-resource, multi-label text classification dataset with noisy, highly sparse labels and many rare concepts. To this end, we propose a novel `dataset-internal' contrastive autoencoding approach to self-supervised pretraining and demonstrate marked improvements in zero-shot, few-shot and solely supervised learning performance; even under an unfavorable low-resource scenario, and without defaulting to large-scale external datasets for self-supervision. We also find empirical evidence that zero and few-shot learning markedly benefit from adding more `dataset-internal', self-supervised training signals, which is of practical importance when retrieving or computing on large external sources of such signals is infeasible.

[1]  Hinrich Schutze,et al.  It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.

[2]  Gao Cong,et al.  Tagging Your Tweets: A Probabilistic Modeling of Hashtag Annotation in Twitter , 2014, CIKM.

[3]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[4]  Yiming Yang,et al.  X-BERT: eXtreme Multi-label Text Classification with using Bidirectional Encoder Representations from Transformers , 2019 .

[5]  James Henderson,et al.  GILE: A Generalized Input-Label Embedding for Text Classification , 2018, TACL.

[6]  Ser-Nam Lim,et al.  A Metric Learning Reality Check , 2020, ECCV.

[7]  Lei Yu,et al.  Learning and Evaluating General Linguistic Intelligence , 2019, ArXiv.

[8]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[9]  N. Rethmeier,et al.  EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings , 2020, medRxiv.

[10]  Francisco Herrera,et al.  Learning from Imbalanced Data Sets , 2018, Springer International Publishing.

[11]  Madian Khabsa,et al.  To Pretrain or Not to Pretrain: Examining the Benefits of Pretrainng on Resource Rich Tasks , 2020, ACL.

[12]  Yaohui Jin,et al.  Multi-Task Label Embedding for Text Classification , 2017, EMNLP.

[13]  Emily Denton,et al.  Characterising Bias in Compressed Models , 2020, ArXiv.

[14]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[15]  Trapit Bansal,et al.  Self-Supervised Meta-Learning for Few-Shot Natural Language Classification Tasks , 2020, EMNLP.

[16]  Phil Blunsom,et al.  Mogrifier LSTM , 2020, ICLR.

[17]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[18]  Tal Linzen,et al.  How Can We Accelerate Progress Towards Human-like Linguistic Generalization? , 2020, ACL.

[19]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[20]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[21]  Sara Hooker,et al.  The hardware lottery , 2020, Commun. ACM.

[22]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[23]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[24]  Hossein Mobahi,et al.  Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Hinrich Schütze,et al.  Rare Words: A Major Problem for Contextualized Embeddings And How to Fix it by Attentive Mimicking , 2019, AAAI.

[27]  Xu Sun,et al.  Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[28]  Ali Farhadi,et al.  Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , 2020, ArXiv.

[29]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[30]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[31]  Barbara Plank,et al.  MoRTy: Unsupervised Learning of Task-specialized Word Embeddings by Autoencoding , 2019, RepL4NLP@ACL.