Facebook AI’s WMT20 News Translation Task Submission

This paper describes Facebook AI's submission to WMT20 shared news translation task. We focus on the low resource setting and participate in two language pairs, Tamil English and Inuktitut English, where there are limited out-of-domain bitext and monolingual data. We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain. We explore techniques that leverage bitext and monolingual data from all languages, such as self-supervised model pretraining, multilingual models, data augmentation, and reranking. To better adapt the translation system to the test domain, we explore dataset tagging and fine-tuning on in-domain data. We observe that different techniques provide varied improvements based on the available data of the language pair. Based on the finding, we integrate these techniques into one training pipeline. For En->Ta, we explore an unconstrained setup with additional Tamil bitext and monolingual data and show that further improvement can be obtained. On the test set, our best submitted systems achieve 21.5 and 13.7 BLEU for Ta->En and En->Ta respectively, and 27.9 and 13.0 for Iu->En and En->Iu respectively.

[1]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[2]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[3]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[4]  Nicola Ueffing,et al.  Using monolingual source-language data to improve MT performance , 2006, IWSLT.

[5]  Jiajun Zhang,et al.  Exploiting Source-side Monolingual Data in Neural Machine Translation , 2016, EMNLP.

[6]  Philipp Koehn,et al.  A Massive Collection of Cross-Lingual Web-Document Pairs , 2019, EMNLP.

[7]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[8]  Ankur Bapna,et al.  Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges , 2019, ArXiv.

[9]  Matthijs Douze,et al.  Learning Joint Multilingual Sentence Representations with Neural Machine Translation , 2017, Rep4NLP@ACL.

[10]  Jiajun Shen,et al.  Revisiting Self-Training for Neural Sequence Generation , 2020, ICLR.

[11]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[12]  Peng-Jen Chen,et al.  The Source-Target Domain Mismatch Problem in Machine Translation , 2019, EACL.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[15]  Matthijs Douze,et al.  FastText.zip: Compressing text classification models , 2016, ArXiv.

[16]  Ciprian Chelba,et al.  Tagged Back-Translation , 2019, WMT.

[17]  Peng-Jen Chen,et al.  Facebook AI’s WAT19 Myanmar-English Translation Task Submission , 2019, EMNLP.

[18]  Yuqing Tang,et al.  Multilingual Translation with Extensible Multilingual Pretraining and Finetuning , 2020, ArXiv.

[19]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[20]  Nathan Ng,et al.  Simple and Effective Noisy Channel Modeling for Neural Machine Translation , 2019, EMNLP.

[21]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[24]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Josep Maria Crego,et al.  Domain Control for Neural Machine Translation , 2016, RANLP.

[26]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[27]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[28]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[29]  Myle Ott,et al.  Facebook FAIR’s WMT19 News Translation Task Submission , 2019, WMT.

[30]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[31]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.

[32]  Holger Schwenk,et al.  CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB , 2019, ArXiv.