Application of Doc2vec and Stochastic Gradient Descent algorithms for Text Categorization

In text categorization, text representation has become an important factor limiting the classification accuracy of classifiers. Ignoring the semantics, grammar, and location information of lexical items in the Vector Space Model (VSM) of a text leads to the loss of considerable feature information that could be useful for text categorization. Therefore, in this study, we propose text classification algorithm combining Doc2vec and Stochastic Gradient Descent algorithms. First, the Doc2vec algorithm is used to train the original corpus to generate the paragraph vectors of the text. Then, for each piece of text, all paragraph vectors are connected as eigenvectors of the text. Finally, the text is classified using the polynomial Naive Bayes and SGD classifier. The experimental results for the 20Newsgroup corpus indicate that our proposed algorithm can classify texts quickly and efficiently with an accuracy of more than 90%.

[1]  Malek Hajjem,et al.  Combining IR and LDA Topic Modeling for Filtering Microblogs , 2017, KES.

[2]  Rui Li,et al.  Resolving Entity Morphs based on Character-Word Embedding , 2017, ICCS.

[3]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[4]  Shun-ichi Amari,et al.  Backpropagation and stochastic gradient descent method , 1993, Neurocomputing.

[5]  Jie Cao,et al.  Analysis of the grain loss in harvest based on logistic regression , 2017, ITQM.

[6]  Dirong Chen,et al.  Learning rates of gradient descent algorithm for classification , 2009 .

[7]  Mehran Kamkarhaghighi,et al.  Content Tree Word Embedding for document representation , 2017, Expert Syst. Appl..

[8]  Yi Yang,et al.  Accelerating Deep Neural Network Training with Inconsistent Stochastic Gradient Descent , 2016, Neural Networks.

[9]  Andrew Zisserman,et al.  Part level transfer regularization for enhancing exemplar SVMs , 2015 .

[10]  Krzysztof Sopyla,et al.  Stochastic Gradient Descent with Barzilai-Borwein update step for SVM , 2015, Inf. Sci..

[11]  Jianping Yin,et al.  Distributed and asynchronous Stochastic Gradient Descent with variance reduction , 2017, Neurocomputing.

[12]  Akebo Yamakami,et al.  Towards filtering undesired short text messages using an online learning approach with semantic indexing , 2017, Expert Syst. Appl..

[13]  George D. C. Cavalcanti,et al.  Combining dissimilarity spaces for text categorization , 2017, Inf. Sci..

[14]  Vili Podgorelec,et al.  Text classification method based on self-training and LDA topic models , 2017, Expert Syst. Appl..

[15]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.