Synthesizing Parallel Data of User-Generated Texts with Zero-Shot Neural Machine Translation

Neural machine translation (NMT) systems are usually trained on clean parallel data. They can perform very well for translating clean in-domain texts. However, as demonstrated by previous work, the translation quality significantly worsens when translating noisy texts, such as user-generated texts (UGT) from online social media. Given the lack of parallel data of UGT that can be used to train or adapt NMT systems, we synthesize parallel data of UGT, exploiting monolingual data of UGT through crosslingual language model pre-training and zero-shot NMT systems. This paper presents two different but complementary approaches: One alters given clean parallel data into UGT-like parallel data whereas the other generates translations from monolingual data of UGT. On the MTNT translation tasks, we show that our synthesized parallel data can lead to better NMT systems for UGT while making them more robust in translating texts from various domains and styles.

[1]  Philipp Koehn,et al.  Findings of the 2015 Workshop on Statistical Machine Translation , 2015, WMT@EMNLP.

[2]  Lucia Specia,et al.  Improving Neural Machine Translation Robustness via Data Augmentation: Beyond Back Translation , 2019, W-NUT@EMNLP.

[3]  Timothy Baldwin,et al.  Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019) , 2019 .

[4]  Marc Dymetman,et al.  Machine Translation of Restaurant Reviews: New Corpus for Domain Adaptation and Robustness , 2019, NGT@EMNLP-IJCNLP.

[5]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[6]  Yonatan Belinkov,et al.  Synthetic and Natural Noise Both Break Neural Machine Translation , 2017, ICLR.

[7]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[8]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[9]  Guillaume Lample,et al.  Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[10]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[11]  Johanna Gerlach,et al.  Combining pre-editing and post-editing to improve SMT of user-generated content , 2013, MTSUMMIT.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Yonatan Belinkov,et al.  Findings of the First Shared Task on Machine Translation Robustness , 2019, WMT.

[14]  Omer Levy,et al.  Training on Synthetic Noise Improves Robustness to Natural Noise in Machine Translation , 2019, EMNLP.

[15]  Mirella Lapata,et al.  Paraphrasing Revisited with Neural Machine Translation , 2017, EACL.

[16]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[17]  Enhong Chen,et al.  Style Transfer as Unsupervised Machine Translation , 2018, ArXiv.

[18]  Zaixiang Zheng,et al.  Mirror-Generative Neural Machine Translation , 2020, ICLR.

[19]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[20]  Yulia Tsvetkov,et al.  Style Transfer Through Back-Translation , 2018, ACL.

[21]  Ciprian Chelba,et al.  Tagged Back-Translation , 2019, WMT.

[22]  Graham Neubig,et al.  MTNT: A Testbed for Machine Translation of Noisy Text , 2018, EMNLP.

[23]  Alon Lavie,et al.  Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability , 2011, ACL.

[24]  André F. T. Martins,et al.  Marian: Fast Neural Machine Translation in C++ , 2018, ACL.

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Graham Neubig,et al.  Improving Robustness of Machine Translation with Synthetic Noise , 2019, NAACL.

[28]  Chris Callison-Burch,et al.  Paraphrasing with Bilingual Parallel Corpora , 2005, ACL.

[29]  Alexandre Berard,et al.  Naver Labs Europe's Systems for the WMT19 Machine Translation Robustness Task , 2019, WMT.

[30]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[31]  Orphée De Clercq,et al.  Benefits of Data Augmentation for NMT-based Text Normalization of User-Generated Content , 2019, EMNLP.

[32]  Chin-Yew Lin,et al.  ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[33]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[34]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[35]  Masao Utiyama,et al.  NICT's Unsupervised Neural and Statistical Machine Translation Systems for the WMT19 News Translation Task , 2019, WMT.