An Efficient Method for Document Categorization Based on Word2vec and Latent Semantic Analysis

Document categorization is the process of classifying documents from many mixed documents automatically, and the main problem is how to express document content in vector space completely. This paper proposes a new model named Latent Semantic Analysis (LSA) + word2vec to categorize documents. This is the first attempt of combining word2vec with LSA at document categorization and it can map document to vector space under the premise of keeping document contents fully. At first, we create a term by document matrix and the element of which is decided by Term Frequency-Inverse Document Frequency (TF-IDF) weighting and word vector trained by word2vec. This matrix is a 3-dimensional matrix and it can describe the meaning of every word and the content of every document exactly. Secondly, Singular Value Decomposition (SVD) is executed on the matrix and lower computational complexity is gained from this. The model is named LSA + word2vec. Then, document vector gained from the new model are put into Convolutional Neural Network (CNN) to train. CNN is an efficient deep learning algorithm, which improves the accuracy of classification greatly. We evaluate the performance based on the 20newsgroups corpus. The results show that our new model achieves better effects on document categorization tasks, and the accuracy made about 15% improvement than traditional methods, such as LSA and Vector Space Model (VSM).

[1]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2]  Bo Yu,et al.  Latent semantic analysis for text categorization using neural network , 2008, Knowl. Based Syst..

[3]  Yi Lu Murphey,et al.  Neural Network Approaches for Text Document Categorization , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[4]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[5]  Wei Song,et al.  An automatically constructed thesaurus for neural network based document categorization , 2009, Expert Syst. Appl..

[6]  Jason D. M. Rennie ifile: An Application of Machine Learning to E-Mail Filtering , 2000 .

[7]  Wai Lam,et al.  Automatic Text Categorization and Its Application to Text Retrieval , 1999, IEEE Trans. Knowl. Data Eng..

[8]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[9]  Songbo Tan,et al.  An effective refinement strategy for KNN text classifier , 2006, Expert Syst. Appl..

[10]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[11]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[12]  S. Dumais Latent Semantic Analysis. , 2005 .

[13]  Mingyong Liu,et al.  An improvement of TFIDF weighting in text categorization , .

[14]  Cheng Hua Li,et al.  Combination of modified BPNN algorithms and an efficient feature selection method for text categorization , 2009, Inf. Process. Manag..

[15]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[16]  Xiaogang Wang,et al.  Deep Learning Face Representation from Predicting 10,000 Classes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[18]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[19]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[20]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[21]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[22]  Guy W. Mineau,et al.  Beyond TFIDF Weighting for Text Categorization in the Vector Space Model , 2005, IJCAI.

[23]  Nenghai Yu,et al.  Mutually beneficial learning with application to on-line news classification , 2007, PIKM '07.

[24]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[25]  Phayung Meesad,et al.  Developing an effective Thai Document Categorization Framework base on term relevance frequency weighting , 2010, 2010 Eighth International Conference on ICT and Knowledge Engineering.

[26]  Jun'ichi Tsujii,et al.  Maximum Entropy Models with Inequality Constraints: A Case Study on Text Categorization , 2005, Machine Learning.

[27]  Muh-Cherng Wu,et al.  An effective application of decision tree to stock trading , 2006, Expert Syst. Appl..