An Intelligent System For Arabic Text Categorization

Text Categorization (classification) is the process of classifying documents into a predefined set of categories based on their content. In this paper, an intelligent Arabic text categorization system is presented. Machine learning algorithms are used in this system. Many algorithms for stemming and feature selection are tried. Moreover, the document is represented using several term weighting schemes and finally the k-nearest neighbor and Rocchio classifiers are used for classification process. Experiments are performed over self collected data corpus and the results show that the suggested hybrid method of statistical and light stemmers is the most suitable stemming algorithm for Arabic language. The results also show that a hybrid approach of document frequency and information gain is the preferable feature selection criterion and normalized-tfidf is the best weighting scheme. Finally, Rocchio classifier has the advantage over k-nearest neighbor classifier in the classification process. The experimental results illustrate that the proposed model is an efficient method and gives generalization accuracy of about 98%.

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  Riyad Al-Shalabi,et al.  A Computational Morphology System for Arabic , 1998, SEMITIC@COLING.

[3]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[4]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[5]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[6]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[7]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[8]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[9]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .

[10]  Fabrizio Sebastiani,et al.  A Tutorial on Automated Text Categorisation , 2000 .

[11]  Ciya Liao,et al.  Feature Preparation in Text Categorization , 2003 .

[12]  Fredric C. Gey,et al.  Building an Arabic Stemmer for Information Retrieval , 2002, TREC.

[13]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[14]  Qasem A. Al-Radaideh,et al.  Using N-grams for Arabic text searching , 2004, J. Assoc. Inf. Sci. Technol..

[15]  Leah S. Larkey,et al.  Arabic Information Retrieval at UMass in TREC-10 , 2001, TREC.

[16]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[17]  Ophir Frieder,et al.  On arabic search: improving the retrieval effectiveness via a light stemming approach , 2002, CIKM '02.