Meaningless yet meaningful: Morphology grounded subword-level NMT

We explore the use of two independent subsystems, namely Byte Pair Encoding (BPE) and Morfessor as basic units for subword-level neural machine translation (NMT). We have shown that for linguistically distant language-pairs Morfessor-based segmentation algorithm produces significantly better quality translation than BPE. However, for close language-pairs BPE-based subword-NMT may translate better than Morfessor-based subword-NMT. We have proposed a combined approach of these two segmentation algorithms Morfessor-BPE (M-BPE) which outperforms these two baseline systems in terms of BLEU score. Our results are supported by experiments on three language-pairs: English-Hindi, BengaliHindi and English-Bengali.

[1]  Rico Sennrich,et al.  Nematus: a Toolkit for Neural Machine Translation , 2017, EACL.

[2]  Mikko Kurimo,et al.  Morfessor FlatCat: An HMM-Based Method for Unsupervised and Semi-Supervised Learning of Morphology , 2014, COLING.

[3]  Pushpak Bhattacharyya,et al.  Orthographic Syllable as basic unit for SMT between Related Languages , 2016, EMNLP.

[4]  Mikko Kurimo,et al.  Morfessor 2.0: Toolkit for statistical morphological segmentation , 2014, EACL.

[5]  S. Arikawa,et al.  Byte Pair Encoding: a Text Compression Scheme That Accelerates Pattern Matching , 1999 .

[6]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[7]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[8]  John DeNero,et al.  Variable-Length Word Encodings for Neural Translation Models , 2015, EMNLP.

[9]  Pushpak Bhattacharyya,et al.  Shata-Anuvadak: Tackling Multiway Translation of Indian Languages , 2014, LREC.

[10]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[11]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[12]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[13]  Mathias Creutz,et al.  Morfessor in the Morpho Challenge , 2006 .

[14]  Girish Nath Jha The TDIL Program and the Indian Langauge Corpora Intitiative (ILCI) , 2010, LREC.

[15]  Pushpak Bhattacharyya,et al.  Learning variable length units for SMT between related languages via Byte Pair Encoding , 2016, SWCN@EMNLP.

[16]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[17]  José A. R. Fonollosa,et al.  Character-based Neural Machine Translation , 2016, ACL.

[18]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[19]  Yoshua Bengio,et al.  A Character-level Decoder without Explicit Segmentation for Neural Machine Translation , 2016, ACL.