论文信息 - ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models - 字舞流文

ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Most widely used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: They can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Because byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.1

Rami Al-Rfou | Colin Raffel | Noah Constant | Sharan Narang | Adam Roberts | Mihir Kale | Linting Xue | Aditya Barua

[1] A. Shashua,et al. Which transformer architecture fits my data? A vocabulary bottleneck in self-attention , 2021, ICML.

[2] Dan Garrette,et al. Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation , 2021, TACL.

[3] Orhan Firat,et al. Towards Continual Learning for Multilingual Machine Translation via Vocabulary Substitution , 2021, NAACL.

[4] Diyi Yang,et al. The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.

[5] Qun Liu,et al. Training Multilingual Pre-trained Language Model with Byte-level Subwords , 2021, ArXiv.

[6] Noam M. Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..

[7] Ankur Bapna,et al. Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus , 2020, COLING.

[8] Colin Raffel,et al. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.

[9] Pierre Zweigenbaum,et al. CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters , 2020, COLING.

[10] Wei Chu,et al. Question Directed Graph Attention Network for Numerical Reasoning over Text , 2020, EMNLP.

[11] Yi Tay,et al. Efficient Transformers: A Survey , 2020, ACM Comput. Surv..

[12] Omer Levy,et al. Neural Machine Translation without Embeddings , 2020, NAACL.

[13] Ngoc Thang Vu,et al. Ensemble Self-Training for Low-Resource Languages: Grapheme-to-Phoneme Conversion and Morphological Inflection , 2020, SIGMORPHON.

[14] Arya D. McCarthy,et al. The SIGMORPHON 2020 Shared Task on Multilingual Grapheme-to-Phoneme Conversion , 2020, SIGMORPHON.

[15] André F. T. Martins,et al. One-Size-Fits-All Multilingual Models , 2020, SIGMORPHON.

[16] Ryan Cotterell,et al. SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection , 2020, SIGMORPHON.

[17] Christo Kirov,et al. Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset , 2020, LREC.

[18] Orhan Firat,et al. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[19] Eunsol Choi,et al. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , 2020, Transactions of the Association for Computational Linguistics.

[20] Philip S. Yu,et al. Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT , 2020, ArXiv.

[21] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.

[22] Peter J. Liu,et al. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2019, ICML.

[23] Mikel Artetxe,et al. On the Cross-lingual Transferability of Monolingual Representations , 2019, ACL.

[24] Peter J. Liu,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[25] Holger Schwenk,et al. MLQA: Evaluating Cross-lingual Extractive Question Answering , 2019, ACL.

[26] Kyunghyun Cho,et al. Neural Machine Translation with Byte-Level Subwords , 2019, AAAI.

[27] Noah Constant,et al. Bridging the Gap for Tokenizer-Free Language Models , 2019, ArXiv.

[28] Jason Baldridge,et al. PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification , 2019, EMNLP.

[29] William Yang Wang,et al. TWEETQA: A Social Media Focused Question Answering Dataset , 2019, ACL.

[30] Preslav Nakov,et al. One Size Does Not Fit All: Comparing NMT Representations of Different Granularities , 2019, NAACL.

[31] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[32] Zachary Chase Lipton,et al. Combating Adversarial Misspellings with Robust Word Recognition , 2019, ACL.

[33] Gabriel Stanovsky,et al. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[34] Tara N. Sainath,et al. Bytes Are All You Need: End-to-end Multilingual Speech Recognition and Synthesis with Bytes , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35] Alona Fyshe,et al. Interpreting Word-Level Hidden State Behaviour of Character-Level LSTM Language Models , 2018, BlackboxNLP@EMNLP.

[36] Guillaume Lample,et al. XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[37] Ankur Bapna,et al. Revisiting Character-Based Neural Machine Translation with Capacity and Compression , 2018, EMNLP.

[38] Mirella Lapata,et al. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[39] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[40] Noah Constant,et al. Character-Level Language Modeling with Deeper Self-Attention , 2018, AAAI.

[41] Roland Vollgraf,et al. Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[42] Taku Kudo,et al. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[43] Samuel R. Bowman,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[44] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[45] Marta R. Costa-jussà,et al. Byte-based Neural Machine Translation , 2017, SWCN@EMNLP.

[46] Chris Dyer,et al. On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[47] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[48] Alex Graves,et al. Neural Machine Translation in Linear Time , 2016, ArXiv.

[49] Jason Lee,et al. Fully Character-Level Neural Machine Translation without Explicit Segmentation , 2016, TACL.

[50] Quoc V. Le,et al. HyperNetworks , 2016, ICLR.

[51] Quoc V. Le,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[52] Yoshua Bengio,et al. Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[53] Jürgen Schmidhuber,et al. Recurrent Highway Networks , 2016, ICML.

[54] Alex Graves,et al. Adaptive Computation Time for Recurrent Neural Networks , 2016, ArXiv.

[55] Yoshua Bengio,et al. A Character-level Decoder without Explicit Segmentation for Neural Machine Translation , 2016, ACL.

[56] José A. R. Fonollosa,et al. Character-based Neural Machine Translation , 2016, ACL.

[57] Yonghui Wu,et al. Exploring the Limits of Language Modeling , 2016, ArXiv.

[58] Oriol Vinyals,et al. Multilingual Language Processing From Bytes , 2015, NAACL.

[59] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[60] Alexandra Birch,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[61] Alexander M. Rush,et al. Character-Aware Neural Language Models , 2015, AAAI.

[62] Alex Graves. Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[63] Ilya Sutskever,et al. Generating Text with Recurrent Neural Networks , 2011, ICML.

[64] Dorothee Reimann,et al. Fifth Conference of the European Chapter of the Association for Computational Linguistics , 1991 .

[65] Venice M. Adams. Louisiana , 1896, The Journal of Comparative Medicine and Veterinary Archives.

[66] Heng Ji,et al. Cross-lingual Name Tagging and Linking for 282 Languages , 2017, ACL.