论文信息 - Pre-training Text Representations as Meta Learning - 字舞流文

Pre-training Text Representations as Meta Learning

Pre-training text representations has recently been shown to significantly improve the state-of-the-art in many natural language processing tasks. The central goal of pre-training is to learn text representations that are useful for subsequent tasks. However, existing approaches are optimized by minimizing a proxy objective, such as the negative log likelihood of language modeling. In this work, we introduce a learning algorithm which directly optimizes model's ability to learn text representations for effective learning of downstream tasks. We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps. The standard multi-task learning objective adopted in BERT is a special case of our learning algorithm where the depth of meta-train is zero. We study the problem in two settings: unsupervised pre-training and supervised pre-training with different pre-training objects to verify the generality of our approach.Experimental results show that our algorithm brings improvements and learns better initializations for a variety of downstream tasks.

Nan Duan | Ming Gong | Ming Zhou | Guihong Cao | Duyu Tang | Daya Guo | Shangwen Lv | Songlin Hu | Linjun Shou | Daxin Jiang | Fuqing Zhu | Yuechen Wang | Ryan Ma | Ming Zhou | Duyu Tang | Guihong Cao | Nan Duan | Daxin Jiang | Songlin Hu | Linjun Shou | Ming Gong | Daya Guo | Shangwen Lv | Yuechen Wang | Fuqing Zhu | Ryan Ma

[1] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[2] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[3] Richard S. Zemel,et al. Prototypical Networks for Few-shot Learning , 2017, NIPS.

[4] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[5] Samuel R. Bowman,et al. The RepEval 2017 Shared Task: Multi-Genre Natural Language Inference with Sentence Representations , 2017, RepEval@EMNLP.

[6] Bartunov Sergey,et al. Meta-Learning with Memory-Augmented Neural Networks , 2016 .

[7] Guokun Lai,et al. Large-scale Cloze Test Dataset Created by Teachers , 2017, EMNLP.

[8] Kevin Gimpel,et al. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units , 2016, ArXiv.

[9] Xiaodong Liu,et al. Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[10] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[11] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[12] Chelsea Finn,et al. Learning to Learn with Gradients , 2018 .

[13] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14] Pieter Abbeel,et al. Meta-Learning with Temporal Convolutions , 2017, ArXiv.

[15] Samuel R. Bowman,et al. Discourse-Based Objectives for Fast Unsupervised Sentence Representation Learning , 2017, ArXiv.

[16] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[17] Hugo Larochelle,et al. Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[18] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19] Xiaodong Liu,et al. Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[20] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[21] Richard Socher,et al. Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[22] Po-Sen Huang,et al. Natural Language to Structured Query Generation via Meta-Learning , 2018, NAACL.

[23] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[24] Yejin Choi,et al. SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[25] Wilson L. Taylor,et al. “Cloze Procedure”: A New Tool for Measuring Readability , 1953 .

[26] Yoshua Bengio,et al. On the Optimization of a Synaptic Learning Rule , 2007 .

[27] Oriol Vinyals,et al. Matching Networks for One Shot Learning , 2016, NIPS.

[28] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[29] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[30] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.