Comparative evaluation of text classification techniques using a large diverse Arabic dataset

A vast amount of valuable human knowledge is recorded in documents. The rapid growth in the number of machine-readable documents for public or private access necessitates the use of automatic text classification. While a lot of effort has been put into Western languages—mostly English—minimal experimentation has been done with Arabic. This paper presents, first, an up-to-date review of the work done in the field of Arabic text classification and, second, a large and diverse dataset that can be used for benchmarking Arabic text classification algorithms. The different techniques derived from the literature review are illustrated by their application to the proposed dataset. The results of various feature selections, weighting methods, and classification algorithms show, on average, the superiority of support vector machine, followed by the decision tree algorithm (C4.5) and Naïve Bayes. The best classification accuracy was 97 % for the Islamic Topics dataset, and the least accurate was 61 % for the Arabic Poems dataset.

[1]  Mohamed S. Abdel-Wahab,et al.  An Intelligent System For Arabic Text Categorization , 2006 .

[2]  Alaa El-Halees,et al.  A Comparative Study on Arabic Text Classification , 2008, Egypt. Comput. Sci. J..

[3]  Rehab Duwairi,et al.  Machine learning for Arabic text categorization , 2006, J. Assoc. Inf. Sci. Technol..

[4]  Laila Khreisat,et al.  Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study , 2006, DMIN.

[5]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[6]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[7]  Fadi Thabtah,et al.  Naïve Bayesian Based on Chi Square to Categorize Arabic Data , 2009 .

[8]  Abdelwadood Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System , 2007 .

[9]  Mohammed J. Bawaneh,et al.  Arabic Text Classification using K-NN and Naive Bayes , 2008 .

[10]  Nicholas Ostler,et al.  Corpus Design Criteria , 1992 .

[11]  John Sinclair Corpus typology : a framework for classification , 1995 .

[12]  Fadi Thabtah,et al.  VSMs with K-Nearest Neighbour to Categorise Arabic Text Data , 2008 .

[13]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[14]  Ghassan Kanaan,et al.  Text Feature Selection using Particle Swarm Optimization Algorithm , 2009 .

[15]  Gunnel Melchers,et al.  Studies in Anglistics , 1995 .

[16]  Natheer Khasawneh,et al.  Feature reduction techniques for Arabic text categorization , 2009 .

[17]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[18]  Ghassan Kanaan,et al.  A comparison of text-classification techniques applied to Arabic text , 2009 .