Distributed keyword vector representation for document categorization

In the age of information explosion, efficiently categorizing the topic of a document can assist our organization and comprehension of the vast amount of text. In this paper, we propose a novel approach, named DKV, for document categorization using distributed real-valued vector representation of keywords learned from neural networks. Such a representation can project rich context information (or embedding) into the vector space, and subsequently be used to infer similarity measures among words, sentences, and even documents. Using a Chinese news corpus containing over 100,000 articles and five topics, we provide a comprehensive performance evaluation to demonstrate that by exploiting the keyword embeddings, DKV paired with support vector machines can effectively categorize a document into the predefined topics. Results demonstrate that our method can achieve the best performances compared to several other approaches.

[1]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[2]  Yung-Chun Chang,et al.  A semantic frame-based intelligent agent for topic detection , 2017, Soft Comput..

[3]  Jerome Rene Bellegarda,et al.  Latent Semantic Mapping , 2007 .

[4]  Ting Wang,et al.  Topic Tracking with Dynamic Topic Model and Topic-based Weighting Method , 2010, J. Softw..

[5]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[6]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[7]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[8]  Yong Yu,et al.  Learning Word Representation Considering Proximity and Ambiguity , 2014, AAAI.

[9]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[10]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[11]  Siu Cheung Hui,et al.  Automatic fuzzy ontology generation for semantic Web , 2006, IEEE Transactions on Knowledge and Data Engineering.

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Rémi Gilleron,et al.  Learning Multi-label Alternating Decision Trees from Texts and Data , 2003, MLDM.

[14]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[15]  Yu Tsao,et al.  Semantic Naïve Bayes Classifier for Document Classification , 2013, IJCNLP.

[16]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[17]  David E. Millard,et al.  Automatic Ontology-Based Knowledge Extraction from Web Documents , 2003, IEEE Intell. Syst..

[18]  Mitsuru Ishizuka,et al.  Topic extraction from news archive using TF*PDF algorithm , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[19]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[20]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[21]  Ramesh Nallapati,et al.  Event threading within news topics , 2004, CIKM '04.

[22]  J.R. Bellegarda,et al.  Latent semantic mapping [information retrieval] , 2005, IEEE Signal Processing Magazine.

[23]  Xiaolong Wang,et al.  On-line Hot Topic Recommendation Using Tolerance Rough Set Based Topic Clustering , 2010, J. Comput..

[24]  Alexander Genkin,et al.  Sparse Logistic Regression for Text Categorization , 2005 .

[25]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[26]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[27]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[28]  Chang-Shing Lee,et al.  Ontological recommendation multi-agent for Tainan City travel , 2009, Expert Syst. Appl..

[29]  Jesualdo Tomás Fernández-Breis,et al.  An ontology-based intelligent system for recruitment , 2006, Expert Syst. Appl..

[30]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[31]  Xiaotie Deng,et al.  Automatic construction of Chinese stop word list , 2006 .

[32]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[33]  Maria P. Grineva,et al.  Extracting key terms from noisy and multitheme documents , 2009, WWW '09.

[34]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[35]  Jong-Hyeok Lee,et al.  Text categorization based on k-nearest neighbor approach for Web site classification , 2003, Inf. Process. Manag..