A Comparative Study of Statistical Feature Reduction Methods for Arabic Text Categorization

Feature reduction methods have been successfully applied to text categorization. In this paper, we perform a comparative study on three feature reduction methods for text categorization, including Document Frequency (DF), Term Frequency Inverse Document Frequency (TFIDF) and Latent Semantic Analyses (LSA). Our feature set is relatively large (since there are thousands of different terms in different texts files). We propose the use of the previous feature reduction methods as a preprocessor of Back-Propagation Neural Network (BPNN) to reduce the input data on training process. The experimental results on an Arabic data set demonstrate that among the three dimensionality reduction techniques proposed, TFIDF was found to be the most effective in reducing the dimensionality of the feature space.

[1]  Natheer Khasawneh,et al.  Feature reduction techniques for Arabic text categorization , 2009 .

[2]  Fouzi Harrag,et al.  Neural Network for Arabic text classification , 2009, 2009 Second International Conference on the Applications of Digital Information and Web Technologies.

[3]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[4]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[5]  Mohamed S. Abdel-Wahab,et al.  An Intelligent System For Arabic Text Categorization , 2006 .

[6]  Abdelwadood Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System , 2007 .

[7]  Dik Lun Lee,et al.  Feature reduction for neural network based text categorization , 1999, Proceedings. 6th International Conference on Advanced Systems for Advanced Applications.

[8]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[9]  Bo Yu,et al.  Latent semantic analysis for text categorization using neural network , 2008, Knowl. Based Syst..

[10]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[11]  Stefan Wermter,et al.  Neural Network Agents for Learning Semantic Text Classification , 2000, Information Retrieval.

[12]  Ghassan Kanaan,et al.  Text Feature Selection using Particle Swarm Optimization Algorithm , 2009 .

[13]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .