论文信息 - BAM! Born-Again Multi-Task Networks for Natural Language Understanding - 字舞流文

BAM! Born-Again Multi-Task Networks for Natural Language Understanding

It can be challenging to train multi-task neural networks that outperform or even match their single-task counterparts. To help address this, we propose using knowledge distillation where single-task models teach a multi-task model. We enhance this training with teacher annealing, a novel method that gradually transitions the model from distillation to supervised learning, helping the multi-task model surpass its single-task teachers. We evaluate our approach by multi-task fine-tuning BERT on the GLUE benchmark. Our method consistently improves over standard single-task and multi-task training.

Quoc V. Le | Christopher D. Manning | Minh-Thang Luong | Urvashi Khandelwal | Kevin Clark | Minh-Thang Luong | Urvashi Khandelwal | Kevin Clark

[1] Zachary Chase Lipton,et al. Born Again Neural Networks , 2018, ICML.

[2] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[3] Chris Brockett,et al. Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[4] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[5] Alex Wang,et al. Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling , 2018, ACL.

[6] Quoc V. Le,et al. Semi-supervised Sequence Learning , 2015, NIPS.

[7] Di He,et al. Multilingual Neural Machine Translation with Knowledge Distillation , 2019, ICLR.

[8] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[9] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[10] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[11] Xiaodong Liu,et al. Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[12] Rich Caruana,et al. Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[13] Xiaodong Liu,et al. Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding , 2019, ArXiv.

[14] Xuanjing Huang,et al. Adversarial Multi-task Learning for Text Classification , 2017, ACL.

[15] Richard Socher,et al. Unifying Question Answering, Text Classification, and Regression via Span Extraction , 2019 .

[16] Quoc V. Le,et al. Multi-task Sequence to Sequence Learning , 2015, ICLR.

[17] Samuel R. Bowman,et al. Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , 2018, ArXiv.

[18] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[19] Richard Socher,et al. The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[20] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[21] Zhi Jin,et al. Distilling Word Embeddings: An Encoding Approach , 2015, CIKM.

[22] Anders Søgaard,et al. Deep multi-task learning with low level tasks supervised at lower layers , 2016, ACL.

[23] Sebastian Ruder,et al. An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[24] Ruslan Salakhutdinov,et al. Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning , 2015, ICLR.

[25] Samuel R. Bowman,et al. Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[26] Rich Caruana,et al. Model compression , 2006, KDD '06.

[27] Barbara Plank,et al. When is multitask learning effective? Semantic sequence prediction under varying data conditions , 2016, EACL.

[28] Ido Dagan,et al. The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[29] Eneko Agirre,et al. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[30] Jason Weston,et al. A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[31] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[32] Richard Socher,et al. Unifying Question Answering and Text Classification via Span Extraction , 2019, ArXiv.

[33] Joachim Bingel,et al. Identifying beneficial task relations for multi-task learning in deep neural networks , 2017, EACL.

[34] Noah A. Smith,et al. Distilling an Ensemble of Greedy Dependency Parsers into One MST Parser , 2016, EMNLP.

[35] Yoshimasa Tsuruoka,et al. A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks , 2016, EMNLP.

[36] Rich Caruana,et al. Multitask Learning , 1997, Machine-mediated learning.

[37] Alexander M. Rush,et al. Sequence-Level Knowledge Distillation , 2016, EMNLP.

[38] S. Holm. A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[39] Thomas Wolf,et al. A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks , 2018, AAAI.

[40] Yee Whye Teh,et al. Distral: Robust multitask reinforcement learning , 2017, NIPS.

[41] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[42] Joachim Bingel,et al. Latent Multi-Task Architecture Learning , 2017, AAAI.

[43] Alex Wang,et al. Looking for ELMo's friends: Sentence-Level Pretraining Beyond Language Modeling , 2018, ArXiv.

[44] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.