论文信息 - Distributed keyword vector representation for document categorization

Distributed keyword vector representation for document categorization

In the age of information explosion, efficiently categorizing the topic of a document can assist our organization and comprehension of the vast amount of text. In this paper, we propose a novel approach, named DKV, for document categorization using distributed real-valued vector representation of keywords learned from neural networks. Such a representation can project rich context information (or embedding) into the vector space, and subsequently be used to infer similarity measures among words, sentences, and even documents. Using a Chinese news corpus containing over 100,000 articles and five topics, we provide a comprehensive performance evaluation to demonstrate that by exploiting the keyword embeddings, DKV paired with support vector machines can effectively categorize a document into the predefined topics. Results demonstrate that our method can achieve the best performances compared to several other approaches.

[1] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[2] Yung-Chun Chang,et al. A semantic frame-based intelligent agent for topic detection , 2017, Soft Comput..

[3] Jerome Rene Bellegarda,et al. Latent Semantic Mapping , 2007 .

[4] Ting Wang,et al. Topic Tracking with Dynamic Topic Model and Topic-based Weighting Method , 2010, J. Softw..

[5] G. Miller,et al. Contextual correlates of semantic similarity , 1991 .

[6] Koray Kavukcuoglu,et al. Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[7] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[8] Yong Yu,et al. Learning Word Representation Considering Proximity and Ambiguity , 2014, AAAI.

[9] Jason Weston,et al. A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[10] Jason Weston,et al. Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[11] Siu Cheung Hui,et al. Automatic fuzzy ontology generation for semantic Web , 2006, IEEE Transactions on Knowledge and Data Engineering.

[12] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13] Rémi Gilleron,et al. Learning Multi-label Alternating Decision Trees from Texts and Data , 2003, MLDM.

[14] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.

[15] Yu Tsao,et al. Semantic Naïve Bayes Classifier for Document Classification , 2013, IJCNLP.

[16] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[17] David E. Millard,et al. Automatic Ontology-Based Knowledge Extraction from Web Documents , 2003, IEEE Intell. Syst..

[18] Mitsuru Ishizuka,et al. Topic extraction from news archive using TF*PDF algorithm , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[19] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[20] Gerard Salton,et al. Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[21] Ramesh Nallapati,et al. Event threading within news topics , 2004, CIKM '04.

[22] J.R. Bellegarda,et al. Latent semantic mapping [information retrieval] , 2005, IEEE Signal Processing Magazine.

[23] Xiaolong Wang,et al. On-line Hot Topic Recommendation Using Tolerance Rough Set Based Topic Clustering , 2010, J. Comput..

[24] Alexander Genkin,et al. Sparse Logistic Regression for Text Categorization , 2005 .

[25] Thomas Hofmann,et al. Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[26] Andrew McCallum,et al. A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[27] Yoshua Bengio,et al. Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[28] Chang-Shing Lee,et al. Ontological recommendation multi-agent for Tainan City travel , 2009, Expert Syst. Appl..

[29] Jesualdo Tomás Fernández-Breis,et al. An ontology-based intelligent system for recruitment , 2006, Expert Syst. Appl..

[30] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[31] Xiaotie Deng,et al. Automatic construction of Chinese stop word list , 2006 .

[32] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[33] Maria P. Grineva,et al. Extracting key terms from noisy and multitheme documents , 2009, WWW '09.

[34] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[35] Jong-Hyeok Lee,et al. Text categorization based on k-nearest neighbor approach for Web site classification , 2003, Inf. Process. Manag..