Mixed-Level Neural Machine Translation

Building the first Russian-Vietnamese neural machine translation system, we faced the problem of choosing a translation unit system on which source and target embeddings are based. Available homogeneous translation unit systems with the same translation unit on the source and target sides do not perfectly suit the investigated language pair. To solve the problem, in this paper, we propose a novel heterogeneous translation unit system, considering linguistic characteristics of the synthetic Russian language and the analytic Vietnamese language. Specifically, we decrease the embedding level on the source side by splitting token into subtokens and increase the embedding level on the target side by merging neighboring tokens into supertoken. The experiment results show that the proposed heterogeneous system improves over the existing best homogeneous Russian-Vietnamese translation system by 1.17 BLEU. Our approach could be applied to building translation bots for language pairs with different linguistic characteristics.

[1]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[2]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[3]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[4]  F. Pellegrino,et al.  A Quantitative and Typological Approach to Correlating Linguistic Complexity , 2013 .

[5]  Masaaki Nagata,et al.  Improving Neural Machine Translation by Incorporating Hierarchical Subword Features , 2018, COLING.

[6]  Hai Zhao,et al.  Finding Better Subword Segmentation for Neural Machine Translation , 2018, CCL.

[7]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[8]  Jason Lee,et al.  Fully Character-Level Neural Machine Translation without Explicit Segmentation , 2016, TACL.

[9]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[10]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[11]  Viet-Trung Tran,et al.  Towards State-of-the-art English-Vietnamese Neural Machine Translation , 2017, SoICT.

[12]  Rico Sennrich,et al.  Edinburgh Neural Machine Translation Systems for WMT 16 , 2016, WMT.

[13]  Marcis Pinnis,et al.  Neural Machine Translation for Morphologically Rich Languages with Improved Sub-word Units and Synthetic Data , 2017, TSD.

[14]  Dejing Dou,et al.  On Adversarial Examples for Character-Level Neural Machine Translation , 2018, COLING.

[15]  Viet-Trung Tran,et al.  Machine Translation between Vietnamese and English: an Empirical Study , 2018, Journal of Computer Science and Cybernetics.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Hien T. Nguyen,et al.  A Character Level Based and Word Level Based Approach for Chinese-Vietnamese Machine Translation , 2016, Comput. Intell. Neurosci..

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[20]  Rico Sennrich,et al.  How Grammatical is Character-level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs , 2016, EACL.

[21]  Yoshua Bengio,et al.  Plan, Attend, Generate: Character-Level Neural Machine Translation with Planning , 2017, Rep4NLP@ACL.

[22]  Christopher D. Manning,et al.  Stanford Neural Machine Translation Systems for Spoken Language Domains , 2015, IWSLT.

[23]  Dai Quoc Nguyen,et al.  VnCoreNLP: A Vietnamese Natural Language Processing Toolkit , 2018, NAACL.

[24]  Dinh Dien,et al.  Word Re-Segmentation in Chinese-Vietnamese Machine Translation , 2016, ACM Trans. Asian Low Resour. Lang. Inf. Process..