The Trade-offs of Domain Adaptation for Neural Language Models

This work connects language model adaptation with concepts of machine learning theory. We consider a training setup with a large out-of-domain set and a small indomain set. We derive how the benefit of training a model on either set depends on the size of the sets and the distance between their underlying distribution. We also present how adaptation technique based on data selection such as importance sampling, intelligent data selection and influence functions, can be presented in a common framework which highlights their similarity and also their subtle differences.

[1]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[2]  Bin Yang,et al.  Learning to Reweight Examples for Robust Deep Learning , 2018, ICML.

[3]  S. Weisberg,et al.  Residuals and Influence in Regression , 1982 .

[4]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[5]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[6]  Iryna Gurevych,et al.  AdapterFusion: Non-Destructive Task Composition for Transfer Learning , 2021, EACL.

[7]  Iain Murray,et al.  BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning , 2019, ICML.

[8]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[9]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[10]  Jonathan Pilault,et al.  Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data , 2020, ArXiv.

[11]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[12]  Richard S. Sutton,et al.  Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.

[13]  Roee Aharoni,et al.  Unsupervised Domain Clusters in Pretrained Language Models , 2020, ACL.

[14]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[15]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[16]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[17]  Taro Watanabe,et al.  Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection , 2018, WMT.

[18]  Christof Monz,et al.  Dynamic Data Selection for Neural Machine Translation , 2017, EMNLP.

[19]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[20]  Marcello Restelli,et al.  Policy Optimization via Importance Sampling , 2018, NeurIPS.

[21]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[22]  Martti Vainio,et al.  Proceedings of the Annual Conference of the International Speech Communication Association , 2016, Interspeech 2016.

[23]  Yann LeCun,et al.  Modeles connexionnistes de l'apprentissage , 1987 .

[24]  Rico Sennrich,et al.  Regularization techniques for fine-tuning in neural machine translation , 2017, EMNLP.

[25]  Ankur Bapna,et al.  Gradient-guided Loss Masking for Neural Machine Translation , 2021, ArXiv.

[26]  David Grangier,et al.  A Discriminative Kernel-based Model to Rank Images from Text Queries , 2007 .

[27]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[28]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[29]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[30]  Rich Caruana,et al.  Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping , 2000, NIPS.

[31]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[32]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[33]  Kosuke Imai,et al.  Survey Sampling , 1998, Nov/Dec 2017.

[34]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[35]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[36]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[37]  Tyler B. Johnson,et al.  Training Deep Models Faster with Robust, Approximate Importance Sampling , 2018, NeurIPS.

[38]  François Fleuret,et al.  Not All Samples Are Created Equal: Deep Learning with Importance Sampling , 2018, ICML.

[39]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[40]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[41]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[43]  Frederick Liu,et al.  Estimating Training Data Influence by Tracking Gradient Descent , 2020, NeurIPS.

[44]  Wojciech Stokowiec,et al.  LanguageCrawl: a generic tool for building language models upon common Crawl , 2016, Language Resources and Evaluation.

[45]  Roland Kuhn,et al.  Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation , 2010, EMNLP.

[46]  Doug Downey,et al.  Sampling Informative Training Data for RNN Language Models , 2018, ACL.

[47]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.