Boosting Neural Machine Translation with Similar Translations

This paper explores data augmentation methods for training Neural Machine Translation to make use of similar translations, in a comparable way a human translator employs fuzzy matches. In particular, we show how we can simply feed the neural model with information on both source and target sides of the fuzzy matches, we also extend the similarity to include semantically related translations retrieved using distributed sentence representations. We show that translations based on fuzzy matching provide the model with "copy" information while translations based on embedding similarities tend to extend the translation "context". Results indicate that the effect from both similar sentences are adding up to further boost accuracy, are combining naturally with model fine-tuning and are providing dynamic adaptation for unseen translation pairs. Tests on multiple data sets and domains show consistent accuracy improvements. To foster research around these techniques, we also release an Open-Source toolkit with efficient and flexible fuzzy-match implementation.

[1]  François Masselot,et al.  A Productivity Test of Statistical Machine Translation Post-Editing in a Typical Localisation Context , 2010, Prague Bull. Math. Linguistics.

[2]  Yong Wang,et al.  Search Engine Guided Neural Machine Translation , 2018, AAAI.

[3]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[4]  F. Steurs,et al.  The 20th Annual Conference of the European Association for Machine Translation , 2017 .

[5]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[6]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[7]  John E. Ortega,et al.  Fuzzy-match repair using black-box machine translation systems: what can be expected? , 2016, AMTA.

[8]  Marcello Federico,et al.  Multi-Domain Neural Machine Translation through Unsupervised Adaptation , 2017, WMT.

[9]  Tom Vanallemeersch,et al.  M3TRA: integrating TM and MT for professional translators , 2018, EAMT.

[10]  Chengqing Zong,et al.  Integrating Translation Memory into Phrase-Based Machine Translation during Decoding , 2013, ACL.

[11]  Anna Zaretskaya,et al.  Integration of Machine Translation in CAT Tools: State of the Art, Evaluation and User Attitudes , 2015 .

[12]  Jan Niehues,et al.  Pre-Translation for Neural Machine Translation , 2016, COLING.

[13]  Matteo Pagliardini,et al.  Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[14]  Jiajun Zhang,et al.  One Sentence One Model for Neural Machine Translation , 2018, LREC.

[15]  Holger Schwenk,et al.  CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB , 2019, ArXiv.

[16]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[19]  Arda Tezcan,et al.  Neural Fuzzy Repair: Integrating Fuzzy Matches into Neural Machine Translation , 2019, ACL.

[20]  Yaser Al-Onaizan,et al.  Training Neural Machine Translation to Apply Terminology Constraints , 2019, ACL.

[21]  Josep Maria Crego,et al.  Domain Control for Neural Machine Translation , 2016, RANLP.

[22]  Satoshi Nakamura,et al.  Guiding Neural Machine Translation with Retrieved Translation Pieces , 2018, NAACL.

[23]  Rui Wang,et al.  A Survey of Domain Adaptation for Neural Machine Translation , 2018, COLING.

[24]  Mike Paterson,et al.  Longest Common Subsequences , 1994, MFCS.

[25]  Rico Sennrich,et al.  Controlling Politeness in Neural Machine Translation via Side Constraints , 2016, NAACL.

[26]  Philipp Koehn,et al.  Convergence of Translation Memory and Statistical Machine Translation , 2010, JEC.

[27]  Masaru Yamada The effect of translation memory databases on productivity , 2011 .

[28]  Christopher D. Manning,et al.  Stanford Neural Machine Translation Systems for Spoken Language Domains , 2015, IWSLT.

[29]  Jean Senellart,et al.  Lexical Micro-adaptation for Neural Machine Translation , 2019, IWSLT.

[30]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[31]  Michael Bloodgood,et al.  Translation memory retrieval methods , 2014, EACL.

[32]  Tom Vanallemeersch,et al.  Assessing linguistically aware fuzzy matching in translation memories , 2015, EAMT.