A Statistical Extension of Byte-Pair Encoding

Sub-word segmentation is currently a standard tool for training neural machine translation (MT) systems and other NLP tasks. The goal is to split words (both in the source and target languages) into smaller units which then constitute the input and output vocabularies of the MT system. The aim of reducing the size of the input and output vocabularies is to increase the generalization capabilities of the translation model, enabling the system to translate and generate infrequent and new (unseen) words at inference time by combining previously seen sub-word units. Ideally, we would expect the created units to have some linguistic meaning, so that words are created in a compositional way. However, the most popular word-splitting method, Byte-Pair Encoding (BPE), which originates from the data compression literature, does not include explicit criteria to favor linguistic splittings nor to find the optimal sub-word granularity for the given training data. In this paper, we propose a statistically motivated extension of the BPE algorithm and an effective convergence criterion that avoids the costly experimentation cycle needed to select the best sub-word vocabulary size. Experimental results with morphologically rich languages show that our model achieves nearly-optimal BLEU scores and produces morphologically better word segmentations, which allows to outperform BPE’s generalization in the translation of sentences containing new words, as shown via human evaluation.

[1]  Mikko Kurimo,et al.  Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline , 2013 .

[2]  Elena Voita,et al.  BPE-Dropout: Simple and Effective Subword Regularization , 2020, ACL.

[3]  James Henderson The Unstoppable Rise of Computational Linguistics in Deep Learning , 2020, ACL.

[4]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[5]  Michael J. Denkowski,et al.  Sockeye: A Toolkit for Neural Machine Translation , 2017, ArXiv.

[6]  Lucia Specia,et al.  The IWSLT 2019 Evaluation Campaign , 2019, IWSLT.

[7]  Ondrej Bojar,et al.  Morphological and Language-Agnostic Word Segmentation for NMT , 2018, TSD.

[8]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.

[9]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Pushpak Bhattacharyya,et al.  Meaningless yet meaningful: Morphology grounded subword-level NMT , 2018 .

[11]  Rico Sennrich,et al.  Revisiting Low-Resource Neural Machine Translation: A Case Study , 2019, ACL.

[12]  Marcello Federico,et al.  An Evaluation of Two Vocabulary Reduction Methods for Neural Machine Translation , 2018, AMTA.

[13]  Antonio Toral,et al.  Abu-MaTran at WMT 2016 Translation Task: Deep Learning, Morphological Segmentation and Tuning on Character Sequences , 2016, WMT.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[16]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[17]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[18]  Marcello Federico,et al.  Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English , 2017, Prague Bull. Math. Linguistics.

[19]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[20]  Alexander M. Fraser,et al.  Target-side Word Segmentation Strategies for Neural Machine Translation , 2017, WMT.

[21]  Artem Sokolov,et al.  Learning to Segment Inputs for NMT Favors Character-Level Processing , 2018, IWSLT.

[22]  David Vilar,et al.  Sockeye 2: A Toolkit for Neural Machine Translation , 2020, EAMT.

[23]  Graham Neubig,et al.  Stronger Baselines for Trustable Results in Neural Machine Translation , 2017, NMT@ACL.