The Source-Target Domain Mismatch Problem in Machine Translation

While we live in an increasingly interconnected world, different places still exhibit strikingly different cultures and many events we experience in our every day life pertain only to the specific place we live in. As a result, people often talk about different things in different parts of the world. In this work we study the effect of local context in machine translation and postulate that this causes the domains of the source and target language to greatly mismatch. We first formalize the concept of source-target domain mismatch, propose a metric to quantify it, and provide empirical evidence for its existence. We conclude with an empirical study of how source-target domain mismatch affects training of machine translation systems on low resource languages. While this may severely affect back-translation, the degradation can be alleviated by combining back-translation with self-training and by increasing the amount of target side monolingual data.

[1]  Mona Baker,et al.  'Corpus Linguistics and Translation Studies: Implications and Applications' , 1993 .

[2]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[3]  Adam Kilgarriff,et al.  Measures for Corpus Similarity and Homogeneity , 1998, EMNLP.

[4]  David P. Wilkins,et al.  Semantic typology and spatial conceptualization , 1998 .

[5]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 1998, ACL.

[6]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[7]  Silvia Bernardini,et al.  When is a universal not a universal , 2004 .

[8]  Alexander M. Fraser,et al.  Improved Machine Translation Performance via Parallel Sentence Extraction from Comparable Corpora , 2004, NAACL.

[9]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[10]  Nicola Ueffing,et al.  Using monolingual source-language data to improve MT performance , 2006, IWSLT.

[11]  Chao Liu,et al.  A probabilistic approach to spatiotemporal theme pattern mining on weblogs , 2006, WWW '06.

[12]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[13]  David M. Blei,et al.  Multilingual Topic Models for Unaligned Text , 2009, UAI.

[14]  Barbara Johnstone Language and Place , 2010 .

[15]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[16]  Gideon Toury Descriptive Translation Studies – and beyond: Revised edition , 2012 .

[17]  Chris Callison-Burch,et al.  Combining Bilingual and Comparable Corpora for Low Resource Machine Translation , 2013, WMT@ACL.

[18]  D. Britain Space, Diffusion and Mobility , 2013 .

[19]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[20]  Kenneth Heafield,et al.  N-gram Counts and Language Models from the Common Crawl , 2014, LREC.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Gerard de Melo,et al.  Detecting Cross-Cultural Differences Using a Multilingual Topic Model , 2016, TACL.

[23]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Markus Freitag,et al.  Fast Domain Adaptation for Neural Machine Translation , 2016, ArXiv.

[25]  Masao Utiyama,et al.  Introducing the Asian Language Treebank (ALT) , 2016, LREC.

[26]  Timothy Baldwin,et al.  Evaluating a Topic Modelling Approach to Measuring Corpus Similarity , 2016, LREC.

[27]  Jiajun Zhang,et al.  Exploiting Source-side Monolingual Data in Neural Machine Translation , 2016, EMNLP.

[28]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[29]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[30]  Francisco Casacuberta,et al.  Adapting Neural Machine Translation with Parallel Synthetic Data , 2017, WMT.

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[32]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[33]  Lemao Liu,et al.  Instance Weighting for Neural Machine Translation Domain Adaptation , 2017, EMNLP.

[34]  Jaehong Park,et al.  Building a Neural Machine Translation System Using Only Synthetic Parallel Data , 2017, ArXiv.

[35]  Christof Monz,et al.  Dynamic Data Selection for Neural Machine Translation , 2017, EMNLP.

[36]  Graham Neubig,et al.  Rapid Adaptation of Neural Machine Translation to New Languages , 2018, EMNLP.

[37]  Guillaume Lample,et al.  Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[38]  Seung-won Hwang,et al.  Mining Cross-Cultural Differences and Similarities in Social Media , 2018, ACL.

[39]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[40]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[41]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[42]  Marc'Aurelio Ranzato,et al.  Analyzing Uncertainty in Neural Machine Translation , 2018, ICML.

[43]  Graham Neubig,et al.  MTNT: A Testbed for Machine Translation of Noisy Text , 2018, EMNLP.

[44]  Shudong Hao,et al.  Learning Multilingual Topics from Incomparable Corpora , 2018, COLING.

[45]  Anders Søgaard,et al.  On the Limitations of Unsupervised Bilingual Dictionary Induction , 2018, ACL.

[46]  François Yvon,et al.  Using Monolingual Data in Neural Machine Translation: a Systematic Study , 2018, WMT.

[47]  Benjamin Van Durme,et al.  Multilingual Anchoring: Interactive Topic Modeling and Alignment Across Languages , 2018, NeurIPS.

[48]  Peng-Jen Chen,et al.  Facebook AI’s WAT19 Myanmar-English Translation Task Submission , 2019, EMNLP.

[49]  Lemao Liu,et al.  Understanding Data Augmentation in Neural Machine Translation: Two Perspectives towards Generalization , 2019, EMNLP.

[50]  Ciprian Chelba,et al.  Tagged Back-Translation , 2019, WMT.

[51]  Antonio Toral,et al.  The Effect of Translationese in Machine Translation Test Sets , 2019, WMT.

[52]  Masao Utiyama,et al.  NOVA , 2018, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[53]  Renjie Zheng,et al.  Robust Machine Translation with Domain Sensitive Pseudo-Sources: Baidu-OSU WMT19 MT Robustness Shared Task System Report , 2019, WMT.

[54]  Philip Resnik,et al.  A Multilingual Topic Model for Learning Weighted Topic Links Across Corpora with Low Comparability , 2019, EMNLP.

[55]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.

[56]  Marc'Aurelio Ranzato,et al.  Revisiting Self-Training for Neural Sequence Generation , 2019, ICLR.

[57]  Masao Utiyama,et al.  Towards Burmese (Myanmar) Morphological Analysis , 2020, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[58]  Myle Ott,et al.  On The Evaluation of Machine Translation SystemsTrained With Back-Translation , 2019, ACL.