Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating those as discrete and atomic tokens, but starting with byte-pair encoding (BPE), subword-based approaches have become dominant in many areas, enabling small vocabularies while still allowing for fast inference. Is the end of the road character-level model or byte-level processing? In this survey, we connect several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subwordbased approaches based on learned segmentation have been proposed and evaluated. We conclude that there is and likely will never be a silver bullet singular solution for all applications and that thinking seriously about tokenization remains important for many applications.

[1]  Yating Yang,et al.  Morphological Word Segmentation on Agglutinative Languages for Neural Machine Translation , 2020, ArXiv.

[2]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[3]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[4]  Ryan Cotterell,et al.  Are All Languages Equally Hard to Language-Model? , 2018, NAACL.

[5]  Jinsong Su,et al.  Bridging Subword Gaps in Pretrain-Finetune Paradigm for Natural Language Generation , 2021, ACL.

[6]  Chris Dyer,et al.  Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling , 2017, ACL.

[7]  Rico Sennrich,et al.  How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology? , 2021, EMNLP.

[8]  Francisco Casacuberta,et al.  How Much Does Tokenization Affect Neural Machine Translation? , 2018, CICLing.

[9]  James Henderson The Unstoppable Rise of Computational Linguistics in Deep Learning , 2020, ACL.

[10]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[11]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[12]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[13]  Christo Kirov,et al.  Neural Polysynthetic Language Modelling , 2020, ArXiv.

[14]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[15]  Wilker Aziz,et al.  A Latent Morphology Model for Open-Vocabulary Neural Machine Translation , 2020, ICLR.

[16]  Iryna Gurevych,et al.  How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models , 2021, ACL/IJCNLP.

[17]  Thomas L. Griffiths,et al.  Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models , 2006, NIPS.

[18]  Edouard Grave,et al.  Training Hybrid Language Models by Marginalizing over Segmentations , 2019, ACL.

[19]  Dave Dopson,et al.  Fast WordPiece Tokenization , 2020, EMNLP.

[20]  Mikko Kurimo,et al.  Empirical Comparison of Evaluation Methods for Unsupervised Learning of Morphology , 2011, TAL.

[21]  Ryan Cotterell,et al.  What Kind of Language Is Hard to Language-Model? , 2019, ACL.

[22]  Thomas L. Griffiths,et al.  Interpolating between types and tokens by estimating power-law generators , 2005, NIPS.

[23]  Ting Liu,et al.  CharBERT: Character-aware Pre-trained Language Model , 2020, COLING.

[24]  Brian Roark,et al.  Probabilistic ParaMor , 2009, CLEF.

[25]  Zaixiang Zheng,et al.  Vocabulary Learning via Optimal Transport for Neural Machine Translation , 2021, ACL/IJCNLP.

[26]  Matthew G. Snover,et al.  A Bayesian Model for Morpheme and Paradigm Identification , 2001, ACL.

[27]  Marcello Federico,et al.  A Statistical Extension of Byte-Pair Encoding , 2021, IWSLT.

[28]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[29]  Zhen Qin,et al.  Charformer: Fast Character Transformers via Gradient-based Subword Tokenization , 2021, ArXiv.

[30]  Thomas L. Griffiths,et al.  Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models , 2011, J. Mach. Learn. Res..

[31]  Matthew G. Snover,et al.  A Probabilistic Model for Learning Concatenative Morphology , 2002, NIPS.

[32]  John Wieting,et al.  CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation , 2021, ArXiv.

[33]  Yoshua Bengio,et al.  Multiscale sequence modeling with a learned dictionary , 2017, ArXiv.

[34]  Eric P. Xing,et al.  Word Shape Matters: Robust Machine Translation with Visual Embedding , 2020, ArXiv.

[35]  Jonne Saleva,et al.  The Effectiveness of Morphology-aware Segmentation in Low-Resource Neural Machine Translation , 2021, EACL.

[36]  Dan Roth,et al.  Cross-Lingual Ability of Multilingual BERT: An Empirical Study , 2019, ICLR.

[37]  Andrew M. Dai,et al.  Language-independent compound splitting with morphological operations , 2011, ACL.

[38]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[39]  Jason Eisner,et al.  Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model , 2018, AAAI.

[40]  Helmut Schmid,et al.  Why don't people use character-level machine translation? , 2021, ArXiv.

[41]  John DeNero,et al.  Painless Unsupervised Learning with Features , 2010, NAACL.

[42]  Matthias Gallé,et al.  Investigating the Effectiveness of BPE: The Power of Shorter Sequences , 2019, EMNLP.

[43]  Xiaoqing Zheng,et al.  Unsupervised Word Segmentation with Bi-directional Neural Language Model , 2021, ArXiv.

[44]  Valentin Hofmann,et al.  Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words , 2021, ACL.

[45]  Lin Yang,et al.  Squared English Word: A Method of Generating Glyph to Use Super Characters for Sentiment Analysis , 2019, AffCon@AAAI.

[46]  Marc-Alexandre Côté,et al.  Revisiting the Hierarchical Multiscale LSTM , 2018, COLING.

[47]  Nikola I. Nikolov,et al.  Character-Level Translation with Self-attention , 2020, ACL.

[48]  Noah A. Smith,et al.  Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank , 2020, EMNLP 2020.

[49]  Omer Levy,et al.  Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens , 2021, ArXiv.

[50]  Mikko Kurimo,et al.  Morfessor 2.0: Toolkit for statistical morphological segmentation , 2014, EACL.

[51]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[52]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[53]  Wonyong Sung,et al.  Character-level language modeling with hierarchical recurrent neural networks , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[55]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[56]  Graham Neubig,et al.  Multi-view Subword Regularization , 2021, NAACL.

[57]  Micha Elsner,et al.  A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability , 2013, EMNLP.

[58]  Pierre Zweigenbaum,et al.  CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters , 2020, COLING.

[59]  Jürgen Schmidhuber,et al.  A Clockwork RNN , 2014, ICML.

[60]  J. Urgen Schmidhuber,et al.  Neural sequence chunkers , 1991, Forschungsberichte, TU Munich.

[61]  Katharina Kann,et al.  How to Adapt Your Pretrained Multilingual Model to 1600 Languages , 2021, ACL.

[62]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[63]  Marian Alexandru Baroni Distributional cues in morpheme discovery: A computational model and empirical evidence , 2000 .

[64]  Hang Li,et al.  AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization , 2020, ArXiv.

[65]  David Yarowsky,et al.  Minimally Supervised Morphological Analysis by Multimodal Alignment , 2000, ACL.

[66]  Hyung Won Chung,et al.  Improving Multilingual Models with Language-Clustered Vocabularies , 2020, EMNLP.

[67]  Wang Ling,et al.  Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[68]  Yu Zhang,et al.  Latent Sequence Decompositions , 2016, ICLR.

[69]  Regina Barzilay,et al.  An Unsupervised Method for Uncovering Morphological Chains , 2015, TACL.

[70]  J. Wolff AN ALGORITHM FOR THE SEGMENTATION OF AN ARTIFICIAL LANGUAGE ANALOGUE , 1975 .

[71]  Chunyu Kit,et al.  Tokenization as the Initial Phase in NLP , 1992, COLING.

[72]  Michael Zhu,et al.  Recurrent Neural Networks with Mixed Hierarchical Structures for Natural Language Processing , 2021, 2021 International Joint Conference on Neural Networks (IJCNN).

[73]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[74]  Yuval Pinter,et al.  Integrating Approaches to Word Representation , 2021, ArXiv.

[75]  Jason Lee,et al.  Fully Character-Level Neural Machine Translation without Explicit Segmentation , 2016, TACL.

[76]  Jonathan May,et al.  Finding the Optimal Vocabulary Size for Neural Machine Translation , 2020, FINDINGS.

[77]  Falcon Z. Dai,et al.  Glyph-aware Embedding of Chinese Characters , 2017, SWCN@EMNLP.

[78]  Frank D. Wood,et al.  The sequence memoizer , 2011, Commun. ACM.

[79]  Eugene Kharitonov,et al.  How BPE Affects Memorization in Transformers , 2021, ArXiv.

[80]  Barbara Plank,et al.  Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss , 2016, ACL.

[81]  Philipp Koehn,et al.  Empirical Methods for Compound Splitting , 2003, EACL.

[82]  Ankur Bapna,et al.  Revisiting Character-Based Neural Machine Translation with Capacity and Compression , 2018, EMNLP.

[83]  Daniel Jurafsky,et al.  Knowledge-Free Induction of Inflectional Morphologies , 2001, NAACL.

[84]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[85]  Chris Dyer,et al.  Learning to Discover, Ground and Use Words with Segmental Neural Language Models , 2018, ACL.

[86]  Elizabeth Salesky,et al.  Robust Open-Vocabulary Translation from Visual Text Representations , 2021, EMNLP.

[87]  Kevin Duh,et al.  BPE and CharCNNs for Translation of Morphology: A Cross-Lingual Comparison and Analysis , 2018, ArXiv.

[88]  Mathias Creutz,et al.  INDUCING THE MORPHOLOGICAL LEXICON OF A NATURAL LANGUAGE FROM UNANNOTATED TEXT , 2005 .

[89]  Zhi-Hong Deng,et al.  Unsupervised Neural Word Segmentation for Chinese via Segmental Language Modeling , 2018, EMNLP.

[90]  Orhan Firat,et al.  Towards End-to-End In-Image Neural Machine Translation , 2020, NLPBT.

[91]  Naoaki Okazaki,et al.  Joint Optimization of Tokenization and Downstream Model , 2021, FINDINGS.

[92]  Djam'e Seddah,et al.  Noisy UGC Translation at the Character Level: Revisiting Open-Vocabulary Capabilities and Robustness of Char-Based Models , 2021, WNUT.

[93]  Mikko Kurimo,et al.  Morfessor FlatCat: An HMM-Based Method for Unsupervised and Semi-Supervised Learning of Morphology , 2014, COLING.

[94]  Bonaventure F. P. Dossou,et al.  Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language , 2021, ArXiv.

[95]  Noah Constant,et al.  Bridging the Gap for Tokenizer-Free Language Models , 2019, ArXiv.

[96]  Ondrej Bojar,et al.  Morphological and Language-Agnostic Word Segmentation for NMT , 2018, TSD.

[97]  Sharon Goldwater,et al.  From Segmentation to Analyses: a Probabilistic Model for Unsupervised Morphology Induction , 2017, EACL.

[98]  Hiroyuki Shindo,et al.  Stochastic Tokenization with a Language Model for Neural Text Classification , 2019, ACL.

[99]  Christopher D. Manning,et al.  Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models , 2016, ACL.

[100]  Mathias Creutz,et al.  Unsupervised models for morpheme segmentation and morphology learning , 2007, TSLP.

[101]  Jacob Eisenstein,et al.  Mimicking Word Embeddings using Subword RNNs , 2017, EMNLP.

[102]  Lin Yang,et al.  Super Characters: A Conversion from Sentiment Classification to Image Classification , 2018, WASSA@EMNLP.

[103]  Irfan Ahmad,et al.  Evaluating Various Tokenizers for Arabic Text Classification , 2021, Neural Processing Letters.

[104]  Hinrich Schutze,et al.  Wine is Not v i n. - On the Compatibility of Tokenizations Across Languages , 2021, EMNLP.

[105]  Nizar Habash,et al.  CoNLL-UL: Universal Morphological Lattices for Universal Dependency Parsing , 2018, LREC.

[106]  A. Moffat,et al.  Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[107]  Orhan Firat,et al.  On the Importance of Word Boundaries in Character-level Neural Machine Translation , 2019, EMNLP.

[108]  Alon Lavie,et al.  ParaMor: Minimally Supervised Induction of Paradigm Structure and Morphological Analysis , 2007, SIGMORPHON.

[109]  Thamar Solorio,et al.  Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality , 2020, EMNLP.

[110]  Nicolas Usunier,et al.  Improving Neural Language Models with a Continuous Cache , 2016, ICLR.

[111]  Naonori Ueda,et al.  Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling , 2009, ACL.

[112]  Marcello Federico,et al.  Compositional Representation of Morphologically-Rich Input for Neural Machine Translation , 2018, ACL.

[113]  Ole Winther,et al.  Hash Embeddings for Efficient Word Representations , 2017, NIPS.

[114]  Yoshua Bengio,et al.  Hierarchical Recurrent Neural Networks for Long-Term Dependencies , 1995, NIPS.

[115]  G. Huet Lexicon-directed segmentation and tagging in Sanskrit , 2003 .

[116]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[117]  Elena Voita,et al.  BPE-Dropout: Simple and Effective Subword Regularization , 2020, ACL.

[118]  Cícero Nogueira dos Santos,et al.  Learning Character-level Representations for Part-of-Speech Tagging , 2014, ICML.

[119]  Frederick Liu,et al.  Learning Character-level Compositionality with Visual Features , 2017, ACL.

[120]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[121]  Zhiyuan Liu,et al.  SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language Model Pretraining , 2021, ArXiv.

[122]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[123]  Pushpak Bhattacharyya,et al.  Meaningless yet meaningful: Morphology grounded subword-level NMT , 2018 .

[124]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[125]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[126]  M. A. Jiménez-Montaño,et al.  On the syntactic structure of protein sequences and the concept of grammar complexity , 1984 .

[127]  Grzegorz Chrupala Text segmentation with character-level text embeddings , 2013, ICML 2013.

[128]  Graham Neubig,et al.  Using Morphological Knowledge in Open-Vocabulary Neural Language Models , 2018, NAACL.

[129]  Vít Novotný,et al.  One Size Does Not Fit All: Finding the Optimal N-gram Sizes for FastText Models across Languages , 2021, ArXiv.

[130]  Yoshua Bengio,et al.  Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[131]  Constantine Lignos Learning from Unseen Data , 2010 .

[132]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[133]  Joakim Nivre,et al.  Universal Word Segmentation: Implementation and Interpretation , 2018, TACL.

[134]  Éric Villemonte de la Clergerie,et al.  MAF: a Morphosyntactic Annotation Framework , 2005 .

[135]  Artem Sokolov,et al.  Learning to Segment Inputs for NMT Favors Character-Level Processing , 2018, IWSLT.

[136]  J. Rissanen Stochastic Complexity in Statistical Inquiry Theory , 1989 .

[137]  Marta R. Costa-jussà,et al.  Neural machine translation using bitmap fonts , 2016 .

[138]  Mikko Kurimo,et al.  Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning , 2020, LREC.

[139]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[140]  Wei Wu,et al.  Glyce: Glyph-vectors for Chinese Character Representations , 2019, NeurIPS.

[141]  Gina-Anne Levow,et al.  A Masked Segmental Language Model for Unsupervised Natural Language Segmentation , 2021, SIGMORPHON.

[142]  Kevin Duh,et al.  A Call for Prudent Choice of Subword Merge Operations in Neural Machine Translation , 2019, MTSummit.

[143]  Marta R. Costa-jussà,et al.  Chinese–Spanish neural machine translation enhanced with character and word bitmap fonts , 2017, Machine Translation.

[144]  Carl de Marcken Linguistic Structure as Composition and Perturbation , 1996, ACL.

[145]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[146]  Christian Bentz,et al.  From characters to words: the turning point of BPE merges , 2021, EACL.

[147]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[148]  Yoshua Bengio,et al.  A Character-level Decoder without Explicit Segmentation for Neural Machine Translation , 2016, ACL.

[149]  Seongbo Jang,et al.  An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks , 2020, AACL.

[150]  Yann LeCun,et al.  Very Deep Convolutional Networks for Text Classification , 2016, EACL.

[151]  Shafiq R. Joty,et al.  Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding , 2020, EMNLP.

[152]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[153]  Elizabeth Salesky,et al.  Optimizing segmentation granularity for neural machine translation , 2018, Machine Translation.

[154]  Mark Johnson,et al.  Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars , 2009, NAACL.

[155]  Marcello Federico,et al.  An Evaluation of Two Vocabulary Reduction Methods for Neural Machine Translation , 2018, AMTA.

[156]  Noah A. Smith,et al.  Segmental Recurrent Neural Networks , 2015, ICLR.

[157]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[158]  Ilya Sutskever,et al.  SUBWORD LANGUAGE MODELING WITH NEURAL NETWORKS , 2011 .

[159]  Jürgen Schmidhuber,et al.  Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[160]  Chong Wang,et al.  Towards Neural Phrase-based Machine Translation , 2017, ICLR.

[161]  Alexander M. Fraser,et al.  Target-side Word Segmentation Strategies for Neural Machine Translation , 2017, WMT.

[162]  Chong Wang,et al.  Sequence Modeling via Segmentations , 2017, ICML.

[163]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[164]  Kei Uchiumi,et al.  Optimizing Word Segmentation for Downstream Task , 2020, FINDINGS.

[165]  Colin Raffel,et al.  ByT5: Towards a token-free future with pre-trained byte-to-byte models , 2021, ArXiv.

[166]  Mikko Kurimo,et al.  Morpho Challenge 2005-2010: Evaluations and Results , 2010, SIGMORPHON.

[167]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[168]  Towards Reasonably-Sized Character-Level Transformer NMT by Finetuning Subword Systems , 2020, EMNLP.

[169]  Pawan Goyal,et al.  A Dataset for Sanskrit Word Segmentation , 2017, LaTeCH@ACL.

[170]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[171]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[172]  Gholamreza Haffari,et al.  Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation , 2020, ACL.

[173]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[174]  Kenneth Ward Church Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p2 , 2000, COLING.

[175]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[176]  Benoît Sagot,et al.  SxPipe 2: architecture pour le traitement pré-syntaxique de corpus bruts , 2008 .

[177]  Carlos Gómez-Rodríguez,et al.  Comparing neural‐ and N‐gram‐based language models for word segmentation , 2018, J. Assoc. Inf. Sci. Technol..

[178]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[179]  Dan Roth,et al.  Extending Multilingual BERT to Low-Resource Languages , 2020, FINDINGS.