Improved Feature Weight Calculation Methods Based on Part-of-Speech in Text Classification

With the development of Information Technology and the increasing number of electronic documents, as a large-scale text information processing means, text classification attracts more and more attention on researchers. In order to obtain better performance in text classification works, two methods on improving the feature weight calculation by introducing the influence of part-of-speech are proposed, one is Single-Part-of-Speech (SPOS) and the other is Multi-Part-of-Speech (MPOS). Contrast experiments between the improved feature weight calculation methods and the original TF-IDF method are conducted. In terms of the improved approaches, the part-of-speech weights are optimized by the Particle Swarm Optimization algorithm. Besides, in order to prove that the improved methods are applicable, Reuters-21578 is used as the corpus in the experiment. The experiment results demonstrate that the improved feature weighting methods perform better than the original TF-IDF method by achieving higher precisions at different dimensions of feature space. In addition, MPOS method works more effectively than SPOS method. Through the in-depth analysis we can also find out that both noun and verb have certain extent of influence, but noun contributes relatively more to classification. Keywords—Text Classification; Part-of-Speech; Particle Swarm Optimization; Feature Weight;

[1]  Lluís Màrquez i Villodre,et al.  SVMTool: A general POS Tagger Generator Based on Support Vector Machines , 2004, LREC.

[2]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[3]  Christos Bouras,et al.  Noun retrieval effect on text summarization and delivery of personalized news articles to the user's desktop , 2010, Data Knowl. Eng..

[4]  Martin Porter,et al.  Snowball: A language for stemming algorithms , 2001 .

[5]  Fuji Ren,et al.  Class-indexing-based term weighting for automatic text classification , 2013, Inf. Sci..

[6]  Zhendong Song Research on Modern Chinese Multi-category Words Part of Speech Tagging Based on Hidden Markov Model , 2014 .

[7]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[8]  William S. Cooper,et al.  Getting beyond Boole , 1988, Inf. Process. Manag..

[9]  Andries Petrus Engelbrecht,et al.  Data clustering using particle swarm optimization , 2003, The 2003 Congress on Evolutionary Computation, 2003. CEC '03..

[10]  Marcos Zampieri,et al.  N-gram Language Models and POS Distribution for the Identification of Spanish Varieties (Ngrammes et Traits Morphosyntaxiques pour la Identification de Variétés de l’Espagnol) [in French] , 2013, JEP/TALN/RECITAL.

[11]  Zhu Zhang,et al.  POS-RS: A Random Subspace method for sentiment classification based on part-of-speech analysis , 2015, Inf. Process. Manag..

[12]  Jing Cao,et al.  Part-of-Speech Tags and ICE Text Classification , 2015 .

[13]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[14]  Boris A. Galitsky Machine learning of syntactic parse trees for search and classification of text , 2013, Eng. Appl. Artif. Intell..

[15]  Atelach Alemu Argaw,et al.  Classifying Amharic webnews , 2008, Information Retrieval.

[16]  Ahmet Cüneyd DOCUMENT CATEGORIZATION WITH MODIFIED STATISTICAL LANGUAGE MODELS FOR AGGLUTINATIVE LANGUAGES , 2010 .

[17]  Andries P. Engelbrecht,et al.  Image Classification using Particle Swarm Optimization , 2002, SEAL.

[18]  Ahmed Guessoum,et al.  A Hidden Markov Model -Based POS Tagger for Arabic , 2006 .

[19]  Chris Buckley,et al.  A probabilistic learning approach for document indexing , 1991, TOIS.

[20]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[21]  Hui Xiong,et al.  A semantic term weighting scheme for text categorization , 2011, Expert Syst. Appl..

[22]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[23]  Amir Masoud Rahmani,et al.  Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA , 2015 .

[24]  N. Mamede,et al.  Automatic readability classifier for European Portuguese , 2014 .

[25]  R. Eberhart,et al.  Empirical study of particle swarm optimization , 1999, Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406).

[26]  Hiroshi Nakagawa,et al.  Two Step POS Selection for SVM Based Text Categorization , 2004, IEICE Trans. Inf. Syst..

[27]  D. R. Ramesh Babu,et al.  A Novel Scheme for Term Weighting in Text Categorization: Positive Impact Factor , 2013, 2013 IEEE International Conference on Systems, Man, and Cybernetics.

[28]  Ghassan Kanaan,et al.  Text Feature Selection using Particle Swarm Optimization Algorithm , 2009 .

[29]  Serkan Günal,et al.  The impact of preprocessing on text classification , 2014, Inf. Process. Manag..

[30]  Stephanie Chua The Role of Parts-of-Speech in Feature Selection , 2008 .

[31]  Yongfeng Huang,et al.  Short text classification based on strong feature thesaurus , 2012, Journal of Zhejiang University SCIENCE C.

[32]  Ziqiang Wang,et al.  A PSO-Based Web Document Classification Algorithm , 2007, Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007).