Investigating an Effective Character-level Embedding in Korean Sentence Classification

Different from the writing systems of many Romance and Germanic languages, some languages or language families show complex conjunct forms in character composition. For such cases where the conjuncts consist of the components representing consonant(s) and vowel, various character encoding schemes can be adopted beyond merely making up a one-hot vector. However, there has been little work done on intra-language comparison regarding performances using each representation. In this study, utilizing the Korean language which is character-rich and agglutinative, we investigate an encoding scheme that is the most effective among Jamo-level one-hot, character-level one-hot, character-level dense, and character-level multi-hot. Classification performance with each scheme is evaluated on two corpora: one on binary sentiment analysis of movie reviews, and the other on multi-class identification of intention types. The result displays that the character-level features show higher performance in general, although the Jamo-level features may show compatibility with the attention-based models if guaranteed adequate parameter set size.

[1]  Daniel Jurafsky,et al.  Neural Language Correction with Character-Based Attention , 2016, ArXiv.

[2]  Min-Jung Kim,et al.  Contextual postprocessing of a Korean OCR system by linguistic constraints , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[3]  W. Bright,et al.  The World's Writing Systems , 1997 .

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[8]  Karl Stratos A Sub-Character Architecture for Korean Language Processing , 2017, EMNLP.

[9]  Hoonyoung Cho,et al.  Sequence-to-sequence Autoencoder based Korean Text Error Correction using Syllable-level Multi-hot Vector Representation , 2018 .

[10]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Wang Ling,et al.  Character-based Neural Machine Translation , 2015, ArXiv.

[13]  Nam Soo Kim,et al.  Real-time Automatic Word Segmentation for User-generated Text , 2018, ArXiv.

[14]  Zhong Zhou,et al.  Tweet2Vec: Character-Based Distributed Representations for Social Media , 2016, ACL.

[15]  Nam Soo Kim,et al.  Text Implicates Prosodic Ambiguity: A Corpus for Intention Identification of the Korean Spoken Language , 2018, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[16]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[17]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[18]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[19]  Timothy Baldwin,et al.  Sub-character Neural Language Modelling in Japanese , 2017, SWCN@EMNLP.

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  C. F. Hockett,et al.  The World's Writing Systems , 1997 .

[22]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[23]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[24]  Xiang Zhang,et al.  Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean? , 2017, ArXiv.