The Effect of Using Light Stemming for Arabic Text Classification

Arabic is one of the Semitic languages in antiquity and one of the six official languages of the UN. Also, Arabic classification plays a significant and essential role in modern applications. There is a big difference between handling English text and Arabic text classification; preprocessing is also challenging for Arabic text. This paper presents the implementation of a Naïve Bayes classifier for Arabic text with and without stemmer. A set of four categories and 800 documents were used from the Text Retrieval Conference (TREC) 2001 dataset. The results showed that Naïve Bayes with light stemmer achieves better results than Naïve Bayes without stemmer. The findings of the classifier accuracy by employing stemmer and without stemmer are as preprocessing. It reveals that the accuracy resulted from the light stemmer was better than the classifier without stemmer detection, which Naïve Bayes Classification with light stemmer got 35.0745 higher than the Naïve Bayes Classification 33.831% without stemmer. After contrasting them, the stemmer got better accuracy than the classifier. Keywords—Arabic language; light stemming; information retrieval; Naïve Bayes classification

[1]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[2]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[3]  Haruna Chiroma,et al.  Machine learning for email spam filtering: review, approaches and open research problems , 2019, Heliyon.

[4]  B. S. Harish,et al.  A New Feature Selection Method based on Intuitionistic Fuzzy Entropy to Categorize Text Documents , 2018, Int. J. Interact. Multim. Artif. Intell..

[5]  Christopher C. Yang Search Engines Information Retrieval in Practice , 2010 .

[6]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[7]  Masnizah Mohd,et al.  Semantically enhanced pseudo relevance feedback for Arabic information retrieval , 2016, J. Inf. Sci..

[8]  Manisha Sharma,et al.  Spam Detection on Social Media Using Semantic Convolutional Neural Network , 2018, Int. J. Knowl. Discov. Bioinform..

[9]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[10]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[11]  Basim Alhadidi,et al.  The effect of using a thesaurus in Arabic information retrieval system , 2012 .

[12]  Himank Gupta,et al.  A framework for real-time spam detection in Twitter , 2018, 2018 10th International Conference on Communication Systems & Networks (COMSNETS).

[13]  Daphne Koller,et al.  Using machine learning to improve information access , 1998 .

[14]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[15]  Basim Alhadidi,et al.  Hybrid Stop-Word Removal Technique for Arabic Language , 2008, Egypt. Comput. Sci. J..

[16]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[17]  Masnizah Mohd,et al.  Enhanced Arabic Information Retrieval: Light Stemming and Stop Words , 2013, M-CAIT.

[18]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[20]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[21]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[22]  Jonathan A. Zdziarski,et al.  Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification , 2005 .