论文信息 - Wine is Not v i n. - On the Compatibility of Tokenizations Across Languages

Wine is Not v i n. - On the Compatibility of Tokenizations Across Languages

The size of the vocabulary is a central design choice in large pretrained language models, with respect to both performance and memory requirements. Typically, subword tokenization algorithms such as byte pair encoding and WordPiece are used. In this work, we investigate the compatibility of tokenizations for multilingual static and contextualized embedding spaces and propose a measure that reflects the compatibility of tokenizations across languages. Our goal is to prevent incompatible tokenizations, e.g., "wine" (word-level) in English vs. "v i n" (character-level) in French, which make it hard to learn good multilingual semantic representations. We show that our compatibility measure allows the system designer to create vocabularies across languages that are compatible – a desideratum that so far has been neglected in multilingual models.

[1] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[2] Mike Schuster,et al. Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Isabelle Augenstein,et al. Zero-Shot Cross-Lingual Transfer with Meta Learning , 2020, EMNLP.

[4] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[5] Richard Socher,et al. BERT is Not an Interlingua and the Bias of Tokenization , 2019, EMNLP.

[6] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[7] Taku Kudo,et al. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[8] Qun Liu,et al. Training Multilingual Pre-trained Language Model with Byte-level Subwords , 2021, ArXiv.

[9] Ming Zhou,et al. Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks , 2019, EMNLP.

[10] Iryna Gurevych,et al. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models , 2021, ACL/IJCNLP.

[11] Mikel Artetxe,et al. On the Cross-lingual Transferability of Monolingual Representations , 2019, ACL.

[12] Guillaume Lample,et al. XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[13] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14] Omer Levy,et al. Are Sixteen Heads Really Better than One? , 2019, NeurIPS.