论文信息 - Synthetic Source Language Augmentation for Colloquial Neural Machine Translation

Synthetic Source Language Augmentation for Colloquial Neural Machine Translation

Neural machine translation (NMT) is typically domain-dependent and style-dependent, and it requires lots of training data. State-of-the-art NMT models often fall short in handling colloquial variations of its source language and the lack of parallel data in this regard is a challenging hurdle in systematically improving the existing models. In this work, we develop a novel colloquial Indonesian-English test-set collected from YouTube transcript and Twitter. We perform synthetic style augmentation to the source formal Indonesian language and show that it improves the baseline Id-En models (in BLEU) over the new test data.

Alham Fikri Aji | Radityo Eko Prasojo | Asrul Sani Ariesandy | Mukhlis Amien | Mukhlis Amien

[1] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[2] Laurent Romary,et al. A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages , 2020, ACL.

[3] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[4] Graham Neubig,et al. Improving Robustness of Machine Translation with Synthetic Noise , 2019, NAACL.

[5] Graham Neubig,et al. MTNT: A Testbed for Machine Translation of Noisy Text , 2018, EMNLP.

[6] Alham Fikri Aji,et al. Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-Translation , 2020, 2020 International Conference on Asian Language Processing (IALP).

[7] André F. T. Martins,et al. Marian: Fast Neural Machine Translation in C++ , 2018, ACL.

[8] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[9] Sara Stymne,et al. Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data , 2020, CALCS.

[10] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[11] Rico Sennrich,et al. Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.