nwjc2vec: Word Embedding Data Constructed from NINJAL Web Japanese Corpus

We constructed word embedding data (named as ’nwjc2vec’) using the NINJAL Web Japanese Corpus and word2vec software, and released it publicly. In this report, nwjc2vec is introduced, and the result of two types of experiments that were conducted to evaluate the quality of nwjc2vec is shown. In the first experiment, the evaluation based on word similarity is considered. Using a word similarity dataset, we calculate Spearman’s rank correlation coefficient. In the second experiment, the evaluation based on task is considered. As the task, we consider word sense disambiguation (WSD) and language model construction using Recurrent Neural Network (RNN). The results obtained using the nwjc2vec were compared with the results obtained using word embedding constructed from the article data of newspaper for seven years. The nwjc2vec is shown to be high quality.