Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality

Building Machine Translation (MT) systems for low-resource languages remains challenging. For many language pairs, parallel data are not widely available, and in such cases MT models do not achieve results comparable to those seen with high-resource languages. When data are scarce, it is of paramount importance to make optimal use of the limited material available. To that end, in this paper we propose employing the same parallel sentences multiple times, only changing the way the words are split each time. For this purpose we use several Byte Pair Encoding models, with various merge operations used in their configuration. In our experiments, we use this technique to expand the available data and improve an MT system involving a low-resource language pair, namely English-Esperanto. As an additional contribution, we made available a set of English-Esperanto parallel data in the literary domain.

[1]  Hermann Ney,et al.  Automatic Filtering of Bilingual Corpora for Statistical Machine Translation , 2005, NLDB.

[2]  F. Gobbo MACHINE TRANSLATION AS A COMPLEX SYSTEM , AND THE PHENOMENON OF ESPERANTO , 2015 .

[3]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[4]  Alberto Poncelas,et al.  Extracting correctly aligned segments from unclean parallel data using character n-gram matching , 2020 .

[5]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[6]  Andy Way,et al.  What Level of Quality can Neural Machine Translation Attain on Literary Text? , 2018, ArXiv.

[7]  Federico Gobbo,et al.  Machine Translation as a Complex System: The Role of Esperanto , 2015 .

[8]  Jiu Sha,et al.  Revisiting Back-Translation for Low-Resource Machine Translation Between Chinese and Vietnamese , 2020, IEEE Access.

[9]  Francisco Casacuberta,et al.  Adapting Neural Machine Translation with Parallel Synthetic Data , 2017, WMT.

[10]  Andy Way,et al.  Pivot Machine Translation Using Chinese as Pivot Language , 2018, Communications in Computer and Information Science.

[11]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[12]  Alberto Poncelas,et al.  The Impact of Indirect Machine Translation on Sentiment Classification , 2020, AMTA.

[13]  Shahram Khadivi,et al.  A discriminative approach to filter out noisy sentence pairs from bilingual corpora , 2010, 2010 5th International Symposium on Telecommunications.

[14]  Alberto Poncelas,et al.  Selecting Backtranslated Data from Multiple Sources for Improved Neural Machine Translation , 2020, ACL.

[15]  高島 清 "The Fall of the House of Usher"--ポウにおける崩壊の美学 , 1987 .

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Andy Way,et al.  Combining PBSMT and NMT Back-translated Data for Efficient NMT , 2019, RANLP.

[18]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[19]  Mark Steedman,et al.  A massively parallel corpus: the Bible in 100 languages , 2014, Lang. Resour. Evaluation.

[20]  Andy Way,et al.  Adaptation of Machine Translation Models with Back-translated Data using Transductive Data Selection Methods , 2019, CICLing.

[21]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[22]  Yoshua Bengio,et al.  Multi-way, multilingual neural machine translation , 2017, Comput. Speech Lang..

[23]  George F. Foster,et al.  The Impact of Sentence Alignment Errors on Phrase-Based Machine Translation Performance , 2012, AMTA.

[24]  Hitoshi Isahara,et al.  A Comparison of Pivot Methods for Phrase-Based Statistical Machine Translation , 2007, NAACL.

[25]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[26]  Elena Voita,et al.  BPE-Dropout: Simple and Effective Subword Regularization , 2020, ACL.

[27]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[28]  Chao-Hong Liu,et al.  Multiple Segmentations of Thai Sentences for Neural Machine Translation , 2020, SLTU.