On the Complementarity of Data Selection and Fine Tuning for Domain Adaptation

Domain adaptation of neural networks commonly relies on three training phases: pretraining, selected data training and then fine tuning. Data selection improves target domain generalization by training further on pretraining data identified by relying on a small sample of target domain data. This work examines the benefit of data selection for language modeling and machine translation. Our experiments assess the complementarity of selection with fine tuning and result in practical recommendations: (i) selected data must be similar to the fine-tuning domain but not so much as to erode the complementary effect of fine-tuning; (ii) there is a trade-off between selecting little data for fast but limited progress or much data for slow but long lasting progress; (iii) data selection can be applied early during pretraining, with performance gains comparable to long pretraining session; (iv) data selection from domain classifiers is often more effective than the popular contrastive data selection method.

[1]  Mikhail Arkhipov,et al.  Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language , 2019, ArXiv.

[2]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[3]  Anders Søgaard,et al.  Semi-Supervised Learning and Domain Adaptation in Natural Language Processing , 2013, Semi-Supervised Learning and Domain Adaptation in Natural Language Processing.

[4]  Sen Wu,et al.  Understanding and Improving Information Transfer in Multi-Task Learning , 2020, ICLR.

[5]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[6]  Yann Dauphin,et al.  Auxiliary Task Update Decomposition: the Good, the Bad and the neutral , 2021, ICLR.

[7]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[8]  Hongbo Xu,et al.  Adapting Naive Bayes to Domain Adaptation for Sentiment Analysis , 2009, ECIR.

[9]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[10]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[11]  Yoshua Bengio,et al.  Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach , 2011, ICML.

[12]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[13]  George F. Foster,et al.  Reinforcement Learning based Curriculum Optimization for Neural Machine Translation , 2019, NAACL.

[14]  Ankur Bapna,et al.  Gradient-guided Loss Masking for Neural Machine Translation , 2021, ArXiv.

[15]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[16]  Lei Li,et al.  Reinforced Co-Training , 2018, NAACL.

[17]  Kenneth Heafield,et al.  ParaCrawl: Web-Scale Acquisition of Parallel Corpora , 2020, ACL.

[18]  Graham Neubig,et al.  Dynamic Data Selection and Weighting for Iterative Back-Translation , 2020, EMNLP.

[19]  Percy Liang,et al.  Distributionally Robust Language Modeling , 2019, EMNLP.

[20]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[21]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[22]  Mikel L. Forcada,et al.  ParaCrawl: Web-scale parallel corpora for the languages of the EU , 2019, MTSummit.

[23]  S. Levine,et al.  Gradient Surgery for Multi-Task Learning , 2020, NeurIPS.

[24]  Christopher D. Manning,et al.  Hierarchical Bayesian Domain Adaptation , 2009, NAACL.

[25]  Roee Aharoni,et al.  Unsupervised Domain Clusters in Pretrained Language Models , 2020, ACL.

[26]  Taro Watanabe,et al.  Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection , 2018, WMT.

[27]  Kevin Duh,et al.  Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation , 2013, ACL.

[28]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[29]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[30]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[31]  Gang Niu,et al.  Scalable Evaluation and Improvement of Document Set Expansion via Neural Positive-Unlabeled Learning , 2019, ArXiv.

[32]  Hal Daumé,et al.  Domain Adaptation for Machine Translation by Mining Unseen Words , 2011, ACL.

[33]  Philipp Koehn,et al.  Findings of the WMT 2020 Shared Task on Parallel Corpus Filtering and Alignment , 2020, WMT.

[34]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[35]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[36]  Jitendra Malik,et al.  Which Tasks Should Be Learned Together in Multi-task Learning? , 2019, ICML.

[37]  Ramakanth Pasunuru,et al.  AutoSeM: Automatic Task Selection and Mixing in Multi-Task Learning , 2019, NAACL.

[38]  Meng Zhang,et al.  Neural Network Methods for Natural Language Processing , 2017, Computational Linguistics.

[39]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[40]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[41]  Christof Monz,et al.  Dynamic Data Selection for Neural Machine Translation , 2017, EMNLP.

[42]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.