New stemming for arabic text classification using feature selection and decision trees

In this paper we conduct a comparative study between two stemming algorithms: khoja stemmer and our new stemmer for Arabic text classification (categorization), using Chisquare statistics as feature selection and focusing on decision tree classifier. Evaluation used a corpus that consists of 5070 documents independently classified into six categories: sport, entertainment, business, middle east, switch and world, on WEKA toolkit. The recall measure is used to compare the performance of these methods. Results show that text classification using our new stemmer outperforms classification using Khoja stemmer. Keywords—Arabic Text classification; Stemming; Decision tree; Chi-square;

[1]  Bassam Al-Shargabi,et al.  A comparative study for Arabic text classification algorithms based on stop words elimination , 2011, ISWSA '11.

[2]  Fekry Olayah,et al.  ARABIC TEXT CLASSIFICATION USING SMO, NAÏVE BAYESIAN, J48 ALGORITHMS , 2011 .

[3]  Abdelwadood Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System , 2007 .

[4]  Abdulmohsen Al-Thubaity,et al.  Automatic Arabic Text Classification , 2008 .

[5]  Anjali Ganesh Jivani,et al.  A Comparative Study of Stemming Algorithms , 2011 .

[6]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[7]  Riyad Al-Shalabi,et al.  Improving KNN Arabic Text Classification with N-Grams Based Document Indexing , 2008 .

[8]  Eric Atwell,et al.  Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers , 2008, COLING.

[9]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[10]  Motaz Saad,et al.  Arabic text classification using decision trees , 2010 .

[11]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[12]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[13]  Riyad Al-Shalabi,et al.  A Computational Morphology System for Arabic , 1998, SEMITIC@COLING.

[14]  William W. Cohen,et al.  On the collective classification of email "speech acts" , 2005, SIGIR '05.

[15]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[16]  Aymen Abu-Errub,et al.  Arabic Text Classification Algorithm using TFIDF and Chi Square Measurements , 2014 .

[17]  Motaz Saad,et al.  OSAC: Open Source Arabic Corpora , 2010 .

[18]  Walid Cherif,et al.  Building a syntactic rules-based stemmer to improve search effectiveness for arabic language , 2014, 2014 9th International Conference on Intelligent Systems: Theories and Applications (SITA-14).