论文信息 - NWJC2Vec: Word embedding dataset from ‘NINJAL Web Japanese Corpus’

NWJC2Vec: Word embedding dataset from ‘NINJAL Web Japanese Corpus’

In this paper, we present a word embedding dataset NWJC2Vec constructed using ‘NINJAL Web Japanese Corpus (NWJC)’. NWJC is a Web-crawled text corpus that contains 25.8 billion tokens. We construct two types of the word embedding dataset: one is based on the surface form, and the other is based on the complete morpheme information provided by UniDic, which is a lexicon for the Japanese morphological analyser MeCab. We perform an evaluation of the dataset by comparing it with the ‘Word List by Semantic Principles (Bunrui Goihyo)’.

Masayuki Asahara

[1] Daisuke Kawahara,et al. Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model , 2015, EMNLP.

[2] Masayuki Asahara,et al. Archiving and Analysing Techniques of the Ultra-Large-Scale Web-Based Corpus Project of NINJAL, Japan , 2014 .

[3] Vít Suchomel,et al. Efficient Web Crawling for Large Text Corpora , 2012 .

[4] Adam Kilgarriff,et al. A Web Corpus and Word Sketches for Japanese , 2008 .

[5] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[6] Daisuke Kawahara,et al. Case Frame Compilation from the Web using High-Performance Computing , 2006, LREC.

[7] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[8] Yuji Matsumoto,et al. Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[9] Adam Kilgarriff,et al. A Corpus Factory for Many Languages , 2010, LREC.

[10] Marco Baroni,et al. Building general- and special-purpose corpora by Web crawling , 2006 .

[11] David A. Shamma,et al. YFCC100M , 2015, Commun. ACM.

[12] Yugo Murawaki,et al. Online Acquisition of Japanese Unknown Morphemes using Morphological Constraints , 2008, EMNLP.

[13] Marco Baroni,et al. Automated construction and evaluation of Japanese Web-based reference corpora , 2005 .

[14] Yuji Matsumoto,et al. Japanese Dependency Analysis using Cascaded Chunking , 2002, CoNLL.