The Effectiveness of Morphology-aware Segmentation in Low-Resource Neural Machine Translation

This paper evaluates the performance of several modern subword segmentation methods in a low-resource neural machine translation setting. We compare segmentations produced by applying BPE at the token or sentence level with morphologically-based segmentations from LMVR and MORSEL. We evaluate translation tasks between English and each of Nepali, Sinhala, and Kazakh, and predict that using morphologically-based segmentation methods would lead to better performance in this setting. However, comparing to BPE, we find that no consistent and reliable differences emerge between the segmentation methods. While morphologically-based methods outperform BPE in a few cases, what performs best tends to vary across tasks, and the performance of segmentation methods is often statistically indistinguishable.

[1]  Philipp Koehn,et al.  Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , 2019, ArXiv.

[2]  Alexander M. Fraser,et al.  Modeling Word Formation in English–German Neural Machine Translation , 2020, ACL.

[3]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.

[4]  Balaram Prasain,et al.  A COMPUTATIONAL ANALYSIS OF NEPALI MORPHOLOGY: A MODEL FOR NATURAL LANGUAGE PROCESSING , 2011 .

[5]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[6]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[7]  Marta R. Costa-jussà,et al.  The TALP-UPC Machine Translation Systems for WMT19 News Translation Task: Pivoting Techniques for Low Resource MT , 2019, WMT.

[8]  Constantine Lignos Learning from Unseen Data , 2010 .

[9]  Gulshat Kessikbayeva,et al.  Rule Based Morphological Analyzer of Kazakh Language , 2014, SIGMORPHON/SIGFSM.

[10]  Mikko Kurimo,et al.  Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline , 2013 .

[11]  C. Roest Morphological Segmentation of Polysynthetic Languages for Neural Machine Translation: The Case of Inuktitut , 2020 .

[12]  Surangika Ranathunga,et al.  Sinhala Word Joiner , 2017, ICON.

[13]  Kyunghyun Cho,et al.  Neural machine translation with a polysynthetic low resource language , 2020, Machine Translation.

[14]  Marcello Federico,et al.  Compositional Representation of Morphologically-Rich Input for Neural Machine Translation , 2018, ACL.

[15]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[16]  Alexander M. Fraser,et al.  Target-side Word Segmentation Strategies for Neural Machine Translation , 2017, WMT.

[17]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[18]  Marcello Federico,et al.  Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English , 2017, Prague Bull. Math. Linguistics.

[19]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Rico Sennrich,et al.  Revisiting Low-Resource Neural Machine Translation: A Case Study , 2019, ACL.

[22]  Yves Scherrer,et al.  The University of Helsinki and Aalto University submissions to the WMT 2020 news and low-resource translation tasks , 2020, WMT@EMNLP.

[23]  Antonio Toral,et al.  Neural Machine Translation for English-Kazakh with Morphological Segmentation and Synthetic Data , 2019, WMT.

[24]  Mikko Kurimo,et al.  Cognate-aware morphological segmentation for multilingual neural translation , 2018, WMT.

[25]  Marcello Federico,et al.  An Evaluation of Two Vocabulary Reduction Methods for Neural Machine Translation , 2018, AMTA.

[26]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[27]  José A. R. Fonollosa,et al.  Character-based Neural Machine Translation , 2016, ACL.

[28]  Mikko Kurimo,et al.  Morfessor FlatCat: An HMM-Based Method for Unsupervised and Semi-Supervised Learning of Morphology , 2014, COLING.

[29]  Constantine Lignos,et al.  Investigating the Relationship Between Linguistic Representation and Computation through an Unsupervised Model of Human Morphology Learning , 2010 .

[30]  Mikko Kurimo,et al.  Proceedings of the Morpho Challenge 2010 Workshop , 2010 .