论文信息 - Self-training Improves Pre-training for Natural Language Understanding - 字舞流文

Self-training Improves Pre-training for Natural Language Understanding

Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a specific task, we introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web. Unlike previous semi-supervised methods, our approach does not require in-domain unlabeled data and is therefore more generally applicable. Experiments show that self-training is complementary to strong RoBERTa baselines on a variety of tasks. Our augmentation approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks. Finally, we also show strong gains on knowledge-distillation and few-shot learning.

Vishrav Chaudhary | Edouard Grave | Michael Auli | Onur Celebi | Beliz Gunel | Jingfei Du | Alexis Conneau | Ves Stoyanov | Jingfei Du | Michael Auli | Edouard Grave | Vishrav Chaudhary | Ves Stoyanov | Beliz Gunel | Alexis Conneau | V. Stoyanov | Onur Çelebi

[1] Vishrav Chaudhary,et al. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[2] Mathias Creutz,et al. Open Subtitles Paraphrase Corpus for Six Languages , 2018, LREC.

[3] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[4] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[5] Holger Schwenk,et al. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[6] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[7] Graham Neubig,et al. A Bilingual Generative Transformer for Semantic Sentence Embedding , 2019, EMNLP.

[8] Nan Hua,et al. Universal Sentence Encoder for English , 2018, EMNLP.

[9] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[10] Ondrej Bojar,et al. Improving Translation Model by Monolingual Data , 2011, WMT@EMNLP.

[11] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[12] Quoc V. Le,et al. Improved Noisy Student Training for Automatic Speech Recognition , 2020, INTERSPEECH.

[13] Jörg Tiedemann,et al. OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora , 2018, LREC.

[14] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[15] Eneko Agirre,et al. SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity , 2012, *SEMEVAL.

[16] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[17] Erik F. Tjong Kim Sang,et al. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[18] Holger Schwenk,et al. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[19] Kan Chen,et al. Billion-scale semi-supervised learning for image classification , 2019, ArXiv.

[20] Iryna Gurevych,et al. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[21] Rico Sennrich,et al. Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[22] Tomas Mikolov,et al. Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[23] Graham Neubig,et al. Simple and Effective Paraphrastic Similarity from Parallel Translations , 2019, ACL.

[24] Jiajun Shen,et al. Revisiting Self-Training for Neural Sequence Generation , 2020, ICLR.

[25] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[26] Ellen M. Voorhees,et al. Building a question answering test collection , 2000, SIGIR '00.

[27] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[28] Kevin Gimpel,et al. Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations , 2017, ArXiv.

[29] Eneko Agirre,et al. *SEM 2013 shared task: Semantic Textual Similarity , 2013, *SEMEVAL.

[30] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[31] Quoc V. Le,et al. Unsupervised Data Augmentation for Consistency Training , 2019, NeurIPS.

[32] David Yarowsky,et al. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[33] Quoc V. Le,et al. Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Eugene Charniak,et al. Effective Self-Training for Parsing , 2006, NAACL.

[35] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[36] Eneko Agirre,et al. SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[37] Claire Cardie,et al. SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability , 2015, *SEMEVAL.

[38] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[39] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[40] Jeff Johnson,et al. Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[41] Eneko Agirre,et al. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[42] Chris Brockett,et al. Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[43] Sanjeev Arora,et al. A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[44] Kevin Gimpel,et al. Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.

[45] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[46] Kawin Ethayarajh,et al. Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline , 2018, Rep4NLP@ACL.

[47] Bing Liu,et al. Mining and summarizing customer reviews , 2004, KDD.

[48] Quoc V. Le,et al. Semi-supervised Sequence Learning , 2015, NIPS.

[49] Quoc V. Le,et al. Rethinking Pre-training and Self-training , 2020, NeurIPS.

[50] Edouard Grave,et al. End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures , 2019, ArXiv.

[51] Christopher Joseph Pal,et al. Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning , 2018, ICLR.

[52] H. J. Scudder,et al. Probability of error of some adaptive pattern-recognition machines , 1965, IEEE Trans. Inf. Theory.

[53] Myle Ott,et al. Understanding Back-Translation at Scale , 2018, EMNLP.

[54] Claire Cardie,et al. SemEval-2014 Task 10: Multilingual Semantic Textual Similarity , 2014, *SEMEVAL.