论文信息 - Jamo Pair Encoding: Subcharacter Representation-based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization

Jamo Pair Encoding: Subcharacter Representation-based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization

In the context of multilingual language model pre-training, vocabulary size for languages with a broad set of potential characters is an unsolved problem. We propose two algorithms applicable in any unsupervised multilingual pre-training task, increasing the elasticity of budget required for building the vocabulary in Byte-Pair Encoding inspired tokenizers, significantly reducing the cost of supporting Korean in a multilingual model.

Sangwhan Moon | Naoaki Okazaki

[1] Oriol Vinyals,et al. Multilingual Language Processing From Bytes , 2015, NAACL.

[2] Chao Liu,et al. Radical Embedding: Delving Deeper to Chinese Radicals , 2015, ACL.

[3] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[4] Rui Li,et al. Multi-Granularity Chinese Word Embedding , 2016, EMNLP.

[5] Alice H. Oh,et al. Subword-level Word Vector Representations for Korean , 2018, ACL.

[6] Nam Soo Kim,et al. Investigating an Effective Character-level Embedding in Korean Sentence Classification , 2019, ArXiv.

[7] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[8] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[9] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[10] Karl Stratos. A Sub-Character Architecture for Korean Language Processing , 2017, EMNLP.

[11] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.