论文信息 - Sub-character Neural Language Modelling in Japanese

Sub-character Neural Language Modelling in Japanese

In East Asian languages such as Japanese and Chinese, the semantics of a character are (somewhat) reflected in its sub-character elements. This paper examines the effect of using sub-characters for language modeling in Japanese. This is achieved by decomposing characters according to a range of character decomposition datasets, and training a neural language model over variously decomposed character representations. Our results indicate that language modelling can be improved through the inclusion of sub-characters, though this result depends on a good choice of decomposition dataset and the appropriate granularity of decomposition.

[1] Julien Quint,et al. Building a Graphetic Dictionary for Japanese kanji - Character Look-up Based on Brush Strokes or Stroke Groups, and the Display of Kanji as Path Data , 2004 .

[2] Wang Ling,et al. Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[3] Timothy Baldwin,et al. Modelling the Orthographic Neighbourhood for Japanese Kanji , 2006, ICCPOL.

[4] Wenjie Li,et al. Component-Enhanced Chinese Character Embeddings , 2015, EMNLP.

[5] Timothy Baldwin,et al. Introduction to Japanese Computational Linguistics , 2016 .

[6] Hermann Ney,et al. LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[7] Chao Liu,et al. Radical Embedding: Delving Deeper to Chinese Radicals , 2015, ACL.

[8] Timothy Baldwin. The hare and the tortoise: speed and accuracy in translation retrieval , 2009, Machine Translation.

[9] Nan Yang,et al. Radical-Enhanced Chinese Character Embedding , 2014, ICONIP.

[10] Kui-Lam Kwok. Comparing representations in Chinese information retrieval , 1997, SIGIR '97.

[11] Alex Graves,et al. Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[12] Yuji Matsumoto,et al. Annotating a Japanese Text Corpus with Predicate-Argument and Coreference Relations , 2007, LAW@ACL.

[13] Rui Li,et al. Multi-Granularity Chinese Word Embedding , 2016, EMNLP.

[14] Tomoko Izumi,et al. Discriminative Approach to Predicate-Argument Structure Analysis with Zero-Anaphora Resolution , 2009, ACL.

[15] Sadao Kurohashi,et al. A Discriminative Approach to Japanese Zero Anaphora Resolution with Large-scale Lexicalized Case Frames , 2011, IJCNLP.

[16] W. Bruce Croft,et al. A comparison of indexing techniques for Japanese text retrieval , 1993, SIGIR.