A Study of Chinese Document Representation and Classification with Word2vec

Word2vec is a neural network language model which can convert words and phrases into a high-quality distributed vector (called word embedding) with semantic word relationships, so it offers a unique perspective to the text classification and other natural language processing (NLP) tasks. In this paper, we propose to combine improved tfidf algorithm and word embedding as a way to represent documents and conduct text classification experiments on the Sogou Chinese classification corpus. Our results show that the combination of word embedding and improved tf-idf algorithm can outperform either individually.

[1]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[3]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[4]  Yun Zhu,et al.  Support vector machines and Word2vec for text classification with semantic features , 2015, 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC).

[5]  Dong Wang,et al.  Document classification with distributions of word vectors , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[6]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[7]  Sungzoon Cho,et al.  Bag-of-concepts: Comprehending document representation through clustering words in distributed representation , 2017, Neurocomputing.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.

[10]  Baobao Chang,et al.  Feature-based Neural Language Model and Chinese Word Segmentation , 2013, IJCNLP.

[11]  Peter Wiemer-Hastings,et al.  Latent semantic analysis , 2004, Annu. Rev. Inf. Sci. Technol..

[12]  Ma Ming-wei Information-gain-based Text Feature Selection Method , 2012 .

[13]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[14]  Chang Choi,et al.  Word Sense Disambiguation Based on Relation Structure , 2008, 2008 International Conference on Advanced Language Processing and Web Information Technology.

[15]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[18]  Nan Yang,et al.  Radical-Enhanced Chinese Character Embedding , 2014, ICONIP.

[19]  Choochart Haruechaiyasak,et al.  Implementing News Article Category Browsing Based on Text Categorization Technique , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[20]  Yue Zhang,et al.  Chinese Parsing Exploiting Characters , 2013, ACL.

[21]  Xiaoqing Zheng,et al.  Deep Learning for Chinese Word Segmentation and POS Tagging , 2013, EMNLP.

[22]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[23]  Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2013, Kaohsiung, Taiwan, October 29 - November 1, 2013 , 2013, APSIPA.

[24]  Ming Zhou,et al.  Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification , 2014, ACL.

[25]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.