Multiple Segmentations of Thai Sentences for Neural Machine Translation

Thai is a low-resource language, so it is often the case that data is not available in sufficient quantities to train an Neural Machine Translation (NMT) model which perform to a high level of quality. In addition, the Thai script does not use white spaces to delimit the boundaries between words, which adds more complexity when building sequence to sequence models. In this work, we explore how to augment a set of English--Thai parallel data by replicating sentence-pairs with different word segmentation methods on Thai, as training data for NMT model training. Using different merge operations of Byte Pair Encoding, different segmentations of Thai sentences can be obtained. The experiments show that combining these datasets, performance is improved for NMT models trained with a dataset that has been split using a supervised splitting tool.

[1]  Chai Wutiwiwatchai,et al.  BEST 2009 : Thai word segmentation software contest , 2009, 2009 Eighth International Symposium on Natural Language Processing.

[2]  Andy Way,et al.  Combining PBSMT and NMT Back-translated Data for Efficient NMT , 2019, RANLP.

[3]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[4]  Laurent Besacier,et al.  Multiple text segmentation for statistical language modeling , 2009, INTERSPEECH.

[5]  Marcello Federico,et al.  Neural Machine Translation into Language Varieties , 2018, WMT.

[6]  Nagul Cooharojananone,et al.  Improving Thai Word and Sentence Segmentation Using Linguistic Knowledge , 2018, IEICE Trans. Inf. Syst..

[7]  Solomon Teferra Abate,et al.  Boosting N-gram Coverage for Unsegmented Languages Using Multiple Text Segmentation Approach , 2010, COLING 2010.

[8]  Andy Way,et al.  Applying N-gram Alignment Entropy to Improve Feature Decay Algorithms , 2017, Prague Bull. Math. Linguistics.

[9]  C. Haruechaiyasak,et al.  A comparative study on Thai word segmentation approaches , 2008, 2008 5th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology.

[10]  Andy Way,et al.  Investigating Backtranslation in Neural Machine Translation , 2018, EAMT.

[11]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[12]  Masao Utiyama,et al.  Introduction of the Asian Language Treebank , 2016, 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA).

[13]  Wirote Aroonmanakun Thoughts on Word and Sentence Segmentation in Thai , 2007 .

[14]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[15]  Andy Way,et al.  Extending Feature Decay Algorithms Using Alignment Entropy , 2016, FETLT.

[16]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[18]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[19]  Ning Xi,et al.  Word Alignment Combination over Multiple Word Segmentation , 2011, ACL.

[20]  Andy Way,et al.  Data Selection with Feature Decay Algorithms Using an Approximated Target Side , 2018, IWSLT.

[21]  Andy Way,et al.  The ADAPT System Description for the IWSLT 2018 Basque to English Translation Task , 2018, IWSLT.

[22]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[23]  Philipp Koehn,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016 .