ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Most widely used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: They can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Because byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.1

[1]  A. Shashua,et al.  Which transformer architecture fits my data? A vocabulary bottleneck in self-attention , 2021, ICML.

[2]  Dan Garrette,et al.  Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation , 2021, TACL.

[3]  Orhan Firat,et al.  Towards Continual Learning for Multilingual Machine Translation via Vocabulary Substitution , 2021, NAACL.

[4]  Diyi Yang,et al.  The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.

[5]  Qun Liu,et al.  Training Multilingual Pre-trained Language Model with Byte-level Subwords , 2021, ArXiv.

[6]  Noam M. Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..

[7]  Ankur Bapna,et al.  Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus , 2020, COLING.

[8]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.

[9]  Pierre Zweigenbaum,et al.  CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters , 2020, COLING.

[10]  Wei Chu,et al.  Question Directed Graph Attention Network for Numerical Reasoning over Text , 2020, EMNLP.

[11]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ACM Comput. Surv..

[12]  Omer Levy,et al.  Neural Machine Translation without Embeddings , 2020, NAACL.

[13]  Ngoc Thang Vu,et al.  Ensemble Self-Training for Low-Resource Languages: Grapheme-to-Phoneme Conversion and Morphological Inflection , 2020, SIGMORPHON.

[14]  Arya D. McCarthy,et al.  The SIGMORPHON 2020 Shared Task on Multilingual Grapheme-to-Phoneme Conversion , 2020, SIGMORPHON.

[15]  André F. T. Martins,et al.  One-Size-Fits-All Multilingual Models , 2020, SIGMORPHON.

[16]  Ryan Cotterell,et al.  SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection , 2020, SIGMORPHON.

[17]  Christo Kirov,et al.  Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset , 2020, LREC.

[18]  Orhan Firat,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[19]  Eunsol Choi,et al.  TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , 2020, Transactions of the Association for Computational Linguistics.

[20]  Philip S. Yu,et al.  Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT , 2020, ArXiv.

[21]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[22]  Peter J. Liu,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2019, ICML.

[23]  Mikel Artetxe,et al.  On the Cross-lingual Transferability of Monolingual Representations , 2019, ACL.

[24]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[25]  Holger Schwenk,et al.  MLQA: Evaluating Cross-lingual Extractive Question Answering , 2019, ACL.

[26]  Kyunghyun Cho,et al.  Neural Machine Translation with Byte-Level Subwords , 2019, AAAI.

[27]  Noah Constant,et al.  Bridging the Gap for Tokenizer-Free Language Models , 2019, ArXiv.

[28]  Jason Baldridge,et al.  PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification , 2019, EMNLP.

[29]  William Yang Wang,et al.  TWEETQA: A Social Media Focused Question Answering Dataset , 2019, ACL.

[30]  Preslav Nakov,et al.  One Size Does Not Fit All: Comparing NMT Representations of Different Granularities , 2019, NAACL.

[31]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[32]  Zachary Chase Lipton,et al.  Combating Adversarial Misspellings with Robust Word Recognition , 2019, ACL.

[33]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[34]  Tara N. Sainath,et al.  Bytes Are All You Need: End-to-end Multilingual Speech Recognition and Synthesis with Bytes , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Alona Fyshe,et al.  Interpreting Word-Level Hidden State Behaviour of Character-Level LSTM Language Models , 2018, BlackboxNLP@EMNLP.

[36]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[37]  Ankur Bapna,et al.  Revisiting Character-Based Neural Machine Translation with Capacity and Compression , 2018, EMNLP.

[38]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[39]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[40]  Noah Constant,et al.  Character-Level Language Modeling with Deeper Self-Attention , 2018, AAAI.

[41]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[42]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[43]  Samuel R. Bowman,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[44]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[45]  Marta R. Costa-jussà,et al.  Byte-based Neural Machine Translation , 2017, SWCN@EMNLP.

[46]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[47]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[48]  Alex Graves,et al.  Neural Machine Translation in Linear Time , 2016, ArXiv.

[49]  Jason Lee,et al.  Fully Character-Level Neural Machine Translation without Explicit Segmentation , 2016, TACL.

[50]  Quoc V. Le,et al.  HyperNetworks , 2016, ICLR.

[51]  Quoc V. Le,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[52]  Yoshua Bengio,et al.  Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[53]  Jürgen Schmidhuber,et al.  Recurrent Highway Networks , 2016, ICML.

[54]  Alex Graves,et al.  Adaptive Computation Time for Recurrent Neural Networks , 2016, ArXiv.

[55]  Yoshua Bengio,et al.  A Character-level Decoder without Explicit Segmentation for Neural Machine Translation , 2016, ACL.

[56]  José A. R. Fonollosa,et al.  Character-based Neural Machine Translation , 2016, ACL.

[57]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[58]  Oriol Vinyals,et al.  Multilingual Language Processing From Bytes , 2015, NAACL.

[59]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[60]  Alexandra Birch,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[61]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[62]  Alex Graves Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[63]  Ilya Sutskever,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[64]  Dorothee Reimann,et al.  Fifth Conference of the European Chapter of the Association for Computational Linguistics , 1991 .

[65]  Venice M. Adams Louisiana , 1896, The Journal of Comparative Medicine and Veterinary Archives.

[66]  Heng Ji,et al.  Cross-lingual Name Tagging and Linking for 282 Languages , 2017, ACL.