论文信息 - Investigating the Effectiveness of BPE: The Power of Shorter Sequences

Investigating the Effectiveness of BPE: The Power of Shorter Sequences

Byte-Pair Encoding (BPE) is an unsupervised sub-word tokenization technique, commonly used in neural machine translation and other NLP tasks. Its effectiveness makes it a de facto standard, but the reasons for this are not well understood. We link BPE to the broader family of dictionary-based compression algorithms and compare it with other members of this family. Our experiments across datasets, language pairs, translation models, and vocabulary size show that - given a fixed vocabulary size budget - the fewer tokens an algorithm needs to cover the test set, the better the translation (as measured by BLEU).

Matthias Gallé | Matthias Gallé

[1] Taku Kudo,et al. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[2] Mike Schuster,et al. Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Yann Dauphin,et al. A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[4] M. A. Jiménez-Montaño,et al. On the syntactic structure of protein sequences and the concept of grammar complexity , 1984 .

[5] Masaaki Nagata,et al. Improving Neural Machine Translation by Incorporating Hierarchical Subword Features , 2018, COLING.

[6] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[7] Abhi Shelat,et al. The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[8] H. S. Heaps,et al. A comparison of algorithms for data base compression by use of fragments as language elements , 1974, Inf. Storage Retr..

[9] Gonzalo Navarro,et al. Re-pair Achieves High-Order Entropy , 2008, Data Compression Conference (dcc 2008).

[10] Matthias Gallé,et al. The Smallest Grammar Problem as Constituents Choice and Minimal Grammar Parsing , 2011, Algorithms.

[11] Jyrki Katajainen,et al. An analysis of the longest match and the greedy heuristics in text encoding , 1992, JACM.

[12] Timo Kötzing,et al. An effective heuristic for the smallest grammar problem , 2013, GECCO '13.

[13] Matthias Sperber,et al. Neural Lattice-to-Sequence Models for Uncertain Inputs , 2017, EMNLP.

[14] Artem Sokolov,et al. Learning to Segment Inputs for NMT Favors Character-Level Processing , 2018, IWSLT.