Sub-character Neural Language Modelling in Japanese

In East Asian languages such as Japanese and Chinese, the semantics of a character are (somewhat) reflected in its sub-character elements. This paper examines the effect of using sub-characters for language modeling in Japanese. This is achieved by decomposing characters according to a range of character decomposition datasets, and training a neural language model over variously decomposed character representations. Our results indicate that language modelling can be improved through the inclusion of sub-characters, though this result depends on a good choice of decomposition dataset and the appropriate granularity of decomposition.

[1]  Julien Quint,et al.  Building a Graphetic Dictionary for Japanese kanji - Character Look-up Based on Brush Strokes or Stroke Groups, and the Display of Kanji as Path Data , 2004 .

[2]  Wang Ling,et al.  Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[3]  Timothy Baldwin,et al.  Modelling the Orthographic Neighbourhood for Japanese Kanji , 2006, ICCPOL.

[4]  Wenjie Li,et al.  Component-Enhanced Chinese Character Embeddings , 2015, EMNLP.

[5]  Timothy Baldwin,et al.  Introduction to Japanese Computational Linguistics , 2016 .

[6]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[7]  Chao Liu,et al.  Radical Embedding: Delving Deeper to Chinese Radicals , 2015, ACL.

[8]  Timothy Baldwin The hare and the tortoise: speed and accuracy in translation retrieval , 2009, Machine Translation.

[9]  Nan Yang,et al.  Radical-Enhanced Chinese Character Embedding , 2014, ICONIP.

[10]  Kui-Lam Kwok Comparing representations in Chinese information retrieval , 1997, SIGIR '97.

[11]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[12]  Yuji Matsumoto,et al.  Annotating a Japanese Text Corpus with Predicate-Argument and Coreference Relations , 2007, LAW@ACL.

[13]  Rui Li,et al.  Multi-Granularity Chinese Word Embedding , 2016, EMNLP.

[14]  Tomoko Izumi,et al.  Discriminative Approach to Predicate-Argument Structure Analysis with Zero-Anaphora Resolution , 2009, ACL.

[15]  Sadao Kurohashi,et al.  A Discriminative Approach to Japanese Zero Anaphora Resolution with Large-scale Lexicalized Case Frames , 2011, IJCNLP.

[16]  W. Bruce Croft,et al.  A comparison of indexing techniques for Japanese text retrieval , 1993, SIGIR.