Investigating the Effectiveness of BPE: The Power of Shorter Sequences

Byte-Pair Encoding (BPE) is an unsupervised sub-word tokenization technique, commonly used in neural machine translation and other NLP tasks. Its effectiveness makes it a de facto standard, but the reasons for this are not well understood. We link BPE to the broader family of dictionary-based compression algorithms and compare it with other members of this family. Our experiments across datasets, language pairs, translation models, and vocabulary size show that - given a fixed vocabulary size budget - the fewer tokens an algorithm needs to cover the test set, the better the translation (as measured by BLEU).

[1]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[2]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Yann Dauphin,et al.  A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[4]  M. A. Jiménez-Montaño,et al.  On the syntactic structure of protein sequences and the concept of grammar complexity , 1984 .

[5]  Masaaki Nagata,et al.  Improving Neural Machine Translation by Incorporating Hierarchical Subword Features , 2018, COLING.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[8]  H. S. Heaps,et al.  A comparison of algorithms for data base compression by use of fragments as language elements , 1974, Inf. Storage Retr..

[9]  Gonzalo Navarro,et al.  Re-pair Achieves High-Order Entropy , 2008, Data Compression Conference (dcc 2008).

[10]  Matthias Gallé,et al.  The Smallest Grammar Problem as Constituents Choice and Minimal Grammar Parsing , 2011, Algorithms.

[11]  Jyrki Katajainen,et al.  An analysis of the longest match and the greedy heuristics in text encoding , 1992, JACM.

[12]  Timo Kötzing,et al.  An effective heuristic for the smallest grammar problem , 2013, GECCO '13.

[13]  Matthias Sperber,et al.  Neural Lattice-to-Sequence Models for Uncertain Inputs , 2017, EMNLP.

[14]  Artem Sokolov,et al.  Learning to Segment Inputs for NMT Favors Character-Level Processing , 2018, IWSLT.

[15]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[16]  Rongrong Ji,et al.  Lattice-Based Recurrent Neural Network Encoders for Neural Machine Translation , 2016, AAAI.

[17]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[18]  Payam Siyari,et al.  Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential Data , 2016, KDD.

[19]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[20]  Ayumi Shinohara,et al.  Linear-Time Text Compression by Longest-First Substitution , 2009, Algorithms.

[21]  Carl de Marcken Linguistic Structure as Composition and Perturbation , 1996, ACL.

[22]  Marcis Pinnis,et al.  Neural Machine Translation for Morphologically Rich Languages with Improved Sub-word Units and Synthetic Data , 2017, TSD.

[23]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[24]  Hiroshi Sakamoto,et al.  Rpair: Rescaling RePair with Rsync , 2019, SPIRE.

[25]  Ankur Bapna,et al.  Revisiting Character-Based Neural Machine Translation with Capacity and Compression , 2018, EMNLP.

[26]  Jon Louis Bentley,et al.  Data compression using long common strings , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[27]  Paolo Ferragina,et al.  Text Compression , 2009, Encyclopedia of Database Systems.

[28]  Alexander M. Fraser,et al.  Target-side Word Segmentation Strategies for Neural Machine Translation , 2017, WMT.

[29]  J. Wolff AN ALGORITHM FOR THE SEGMENTATION OF AN ARTIFICIAL LANGUAGE ANALOGUE , 1975 .

[30]  Francisco Casacuberta,et al.  How Much Does Tokenization Affect Neural Machine Translation? , 2018, CICLing.

[31]  Robert A. Wagner,et al.  Common phrases and minimum-space text storage , 1973, CACM.

[32]  Matthias Gallé,et al.  Searching for smallest grammars on large sequences and application to DNA , 2012, J. Discrete Algorithms.

[33]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[34]  Elizabeth Salesky,et al.  Optimizing segmentation granularity for neural machine translation , 2018, Machine Translation.

[35]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[36]  Craig G. Nevill-Manning,et al.  Compression and Explanation Using Hierarchical Grammars , 1997, Comput. J..