Jamo Pair Encoding: Subcharacter Representation-based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization

In the context of multilingual language model pre-training, vocabulary size for languages with a broad set of potential characters is an unsolved problem. We propose two algorithms applicable in any unsupervised multilingual pre-training task, increasing the elasticity of budget required for building the vocabulary in Byte-Pair Encoding inspired tokenizers, significantly reducing the cost of supporting Korean in a multilingual model.

[1]  Oriol Vinyals,et al.  Multilingual Language Processing From Bytes , 2015, NAACL.

[2]  Chao Liu,et al.  Radical Embedding: Delving Deeper to Chinese Radicals , 2015, ACL.

[3]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[4]  Rui Li,et al.  Multi-Granularity Chinese Word Embedding , 2016, EMNLP.

[5]  Alice H. Oh,et al.  Subword-level Word Vector Representations for Korean , 2018, ACL.

[6]  Nam Soo Kim,et al.  Investigating an Effective Character-level Embedding in Korean Sentence Classification , 2019, ArXiv.

[7]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[8]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[9]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[10]  Karl Stratos A Sub-Character Architecture for Korean Language Processing , 2017, EMNLP.

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[13]  Mirella Lapata,et al.  54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers , 2016, ACL 2016.

[14]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[15]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[16]  Nan Yang,et al.  Radical-Enhanced Chinese Character Embedding , 2014, ICONIP.

[17]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[18]  Alistair A. Young,et al.  Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , 2017, MICCAI 2017.

[19]  Lei Wu,et al.  Dual Long Short-Term Memory Networks for Sub-Character Representation Learning , 2017, ArXiv.

[20]  Jaime G. Carbonell,et al.  Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations , 2018, EMNLP.

[21]  Masao Utiyama,et al.  Simplified Abugidas , 2018, ACL.