论文信息 - A Study of Chinese Document Representation and Classification with Word2vec

A Study of Chinese Document Representation and Classification with Word2vec

Word2vec is a neural network language model which can convert words and phrases into a high-quality distributed vector (called word embedding) with semantic word relationships, so it offers a unique perspective to the text classification and other natural language processing (NLP) tasks. In this paper, we propose to combine improved tfidf algorithm and word embedding as a way to represent documents and conduct text classification experiments on the Sogou Chinese classification corpus. Our results show that the combination of word embedding and improved tf-idf algorithm can outperform either individually.

[1] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2] Yoshua Bengio,et al. Neural Probabilistic Language Models , 2006 .

[3] Geoffrey E. Hinton,et al. Three new graphical models for statistical language modelling , 2007, ICML '07.

[4] Yun Zhu,et al. Support vector machines and Word2vec for text classification with semantic features , 2015, 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC).

[5] Dong Wang,et al. Document classification with distributions of word vectors , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[6] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[7] Sungzoon Cho,et al. Bag-of-concepts: Comprehending document representation through clustering words in distributed representation , 2017, Neurocomputing.

[8] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9] Qun Liu,et al. HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.

[10] Baobao Chang,et al. Feature-based Neural Language Model and Chinese Word Segmentation , 2013, IJCNLP.

[11] Peter Wiemer-Hastings,et al. Latent semantic analysis , 2004, Annu. Rev. Inf. Sci. Technol..

[12] Ma Ming-wei. Information-gain-based Text Feature Selection Method , 2012 .

[13] Thomas Hofmann,et al. Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[14] Chang Choi,et al. Word Sense Disambiguation Based on Relation Structure , 2008, 2008 International Conference on Advanced Language Processing and Web Information Technology.

[15] Geoffrey Zweig,et al. Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[16] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17] A. McCallum,et al. Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[18] Nan Yang,et al. Radical-Enhanced Chinese Character Embedding , 2014, ICONIP.

[19] Choochart Haruechaiyasak,et al. Implementing News Article Category Browsing Based on Text Categorization Technique , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[20] Yue Zhang,et al. Chinese Parsing Exploiting Characters , 2013, ACL.

[21] Xiaoqing Zheng,et al. Deep Learning for Chinese Word Segmentation and POS Tagging , 2013, EMNLP.

[22] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[23] Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2013, Kaohsiung, Taiwan, October 29 - November 1, 2013 , 2013, APSIPA.

[24] Ming Zhou,et al. Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification , 2014, ACL.

[25] Andrew Y. Ng,et al. Parsing with Compositional Vector Grammars , 2013, ACL.