Arabic Stemming Techniques as Feature Extraction Applied in Arabic Text Classification

In this paper, we conduct a comparative study about the impact of stemming algorithms, as feature extraction systems, on the task of classification of Arabic text documents. Stemming is forceful and fierce as in reducing words to their three-letters roots. Which may influence the semantics, as various words with divers implications may share the same root. Light stemming, by examination, expels oftentimes utilized prefixes and suffixes in Arabic words. Light stemming doesn’t extract the root and thus doesn’t influence the semantics of words. However, the result of the light stemming is not necessarily a word. For the evaluation, we used corpus contains 5,070 records that fall into six classes. A several tests were done utilizing two separate illustrations of the same corpus. The K-Nearest Neighbors (KNN) classifier was utilized for the classification task. The recall measure is used to evaluate the performance of these methods.

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[3]  Anjali Ganesh Jivani,et al.  A Comparative Study of Stemming Algorithms , 2011 .

[4]  Riyad Al-Shalabi,et al.  A Computational Morphology System for Arabic , 1998, SEMITIC@COLING.

[5]  Eric Atwell,et al.  Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers , 2008, COLING.

[6]  Michael A. Shepherd,et al.  Support vector machines for text categorization , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[7]  Abdelwadood Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System , 2007 .

[8]  Wei-Ying Ma,et al.  OCFS: optimal orthogonal centroid feature selection for text categorization , 2005, SIGIR '05.

[9]  Young-Woo Seo,et al.  Feature Selection for Extracting Semantically Rich Words , 2004 .

[10]  David Madigan,et al.  On the Naive Bayes Model for Text Categorization , 2003, AISTATS.

[11]  Yaxin Bi,et al.  An kNN Model-Based Approach and Its Application in Text Categorization , 2004, CICLing.

[12]  Peng Wang,et al.  Semantic Clustering and Convolutional Neural Network for Short Text Categorization , 2015, ACL.

[13]  R. Al Shalabi,et al.  New approach for extracting Arabic roots , 2003 .

[14]  Abdulmohsen Al-Thubaity,et al.  Automatic Arabic Text Classification , 2008 .

[15]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[16]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[17]  Teresa Bernarda Ludermir,et al.  Automatic text categorization: case study , 2002, VII Brazilian Symposium on Neural Networks, 2002. SBRN 2002. Proceedings..