Subcharacter Information in Japanese Embeddings: When Is It Worth It?

Languages with logographic writing systems present a difficulty for traditional character-level models. Leveraging the subcharacter information was recently shown to be beneficial for a number of intrinsic and extrinsic tasks in Chinese. We examine whether the same strategies could be applied for Japanese, and contribute a new analogy dataset for this language.

[1]  Hao Xin,et al.  Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components , 2017, EMNLP.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Kenta Oono,et al.  Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .

[4]  Rui Li,et al.  Multi-Granularity Chinese Word Embedding , 2016, EMNLP.

[5]  Mamoru Komachi,et al.  Construction of a Japanese Word Similarity Dataset , 2017, LREC.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Rada Mihalcea,et al.  Factors Influencing the Surprising Instability of Word Embeddings , 2018, NAACL.

[8]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[9]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[10]  Yong Zhang,et al.  The Sentimental Value of Chinese Sub-Character Components , 2017, SIGHAN@IJCNLP.

[11]  Masafumi Hagiwara,et al.  Radical-level Ideograph Encoder for RNN-based Sentiment Analysis of Chinese and Japanese , 2017, ACML.

[12]  Kikuo Maekawa Compilation of the Balanced Corpus of Contemporary Written Japanese in the KOTONOHA Initiative (Invited Paper) , 2008, 2008 Second International Symposium on Universal Communication.

[13]  Wenjie Li,et al.  Component-Enhanced Chinese Character Embeddings , 2015, EMNLP.

[14]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[15]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[16]  Satoshi Matsuoka,et al.  Word Embeddings, Analogies, and Machine Learning: Beyond king - man + woman = queen , 2016, COLING.

[17]  Bofang Li,et al.  The (too Many) Problems of Analogical Reasoning with Word Vectors , 2017, *SEMEVAL.

[18]  Nan Yang,et al.  Radical-Enhanced Chinese Character Embedding , 2014, ICONIP.

[19]  Timothy Baldwin,et al.  Sub-character Neural Language Modelling in Japanese , 2017, SWCN@EMNLP.

[20]  Frederick Liu,et al.  Learning Character-level Compositionality with Visual Features , 2017, ACL.

[21]  Jörg Tiedemann,et al.  Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF , 2017, IJCNLP.

[22]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[23]  Xiang Zhang,et al.  Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean? , 2017, ArXiv.

[24]  Satoshi Matsuoka,et al.  Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. , 2016, NAACL.