Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm

This paper deals with automatic classification of Arabic web documents. Such a classification is very useful for affording directory search functionality, which has been used by many web portals and search engines to cope with an ever-increasing number of documents on the web. In this paper, Naive Bayes (NB) which is a statistical machine learning algorithm, is used to classify non-vocalized Arabic web documents (after their words have been transformed to the corresponding canonical form, i.e., roots) to one of five pre-defined categories. Cross validation experiments are used to evaluate the NB categorizer. The data set used during these experiments consists of 300 web documents per category. The results of cross validation in the leave-one-out experiment show that, using 2,000 terms/roots, the categorization accuracy varies from one category to another with an average accuracy over all categories of 68.78 %. Furthermore, the best categorization performance by category during cross validation experiments goes up to 92.8%. Further tests carried out on a manually collected evaluation set which consists of 10 documents from each of the 5 categories, show that the overall classification accuracy achieved over all categories is 62%, and that the best result by category reaches 90%.

[1]  Bojan Cestnik,et al.  Estimating Probabilities: A Crucial Task in Machine Learning , 1990, ECAI.

[2]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[3]  David L. Waltz,et al.  Trading MIPS and memory for knowledge engineering , 1992, CACM.

[4]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[5]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[6]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[7]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[8]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[9]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[10]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[11]  Kostas Tzeras,et al.  Automatic indexing based on Bayesian inference networks , 1993, SIGIR.

[12]  Amine Bensaid,et al.  Barq: distributed multilingual internet search engine with focus on Arabic language , 2003, SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483).

[13]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[14]  Riyad Al-Shalabi,et al.  A Computational Morphology System for Arabic , 1998, SEMITIC@COLING.

[15]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[16]  F. Schwartz,et al.  Using Clustering to Boost Text Classification , 2001 .

[17]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[18]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[19]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[20]  Koby Crammer,et al.  A Family of Additive Online Algorithms for Category Ranking , 2003, J. Mach. Learn. Res..