Arabic Light Stemming: A Comparative Study between P-Stemmer, Khoja Stemmer, and Light10 Stemmer

Arabic is a derived language that has a deep structure and words meaning, one of the Arabic challenges is its morphology dependency. Arabic Natural Language Processing (ANLP) tools are required to achieve many tasks, such as Machine learning. For the text classification task, the ANLP is considered as preprocessing steps. These preprocessing steps include but not limited to Stemming, Normalization, and Stop-words Removal. In this work, we collected 2,000 news articles from Arabic online newspapers, the data were classified using Support Vector Machine (SVM) and Nave Base (NB) classifiers. The classification task was conducted for the purpose of comparing three different Arabic light stemmers; P-Stemmer, Khoja Stemmer, and Light10 Stemmer. The P-Stemmer results was dominating the other two stemmers in both SVM and NB classifiers with accuracy of 0.92 for F1-measure in SVM classifier and 0.90 for F1-Measure in NB classifier.

[1]  Tarek Kanan,et al.  A Review of Natural Language Processing and Machine Learning Tools Used to Analyze Arabic Social Media , 2019, 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT).

[2]  Farshad Fotouhi,et al.  Diffusion Maps: A Superior Semantic Method to Improve Similarity Join Performance , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[3]  Maysam Abbod,et al.  Enhanced Hidden Markov Models for accelerating medical volumes segmentation , 2011, 2011 IEEE GCC Conference and Exhibition (GCC).

[4]  R. Duwairi,et al.  Stemming Versus Light Stemming as Feature Selection Techniques for Arabic Text Categorization , 2007, 2007 Innovations in Information Technologies (IIT).

[5]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[6]  Edward A. Fox,et al.  Automated arabic text classification with P‐Stemmer, machine learning, and a tailored news article taxonomy , 2016, J. Assoc. Inf. Sci. Technol..

[7]  Tarek Kanan,et al.  Multi-orientation geometric medical volumes segmentation using 3D multiresolution analysis , 2018, Multimedia Tools and Applications.

[8]  Shadi AlZu'bi,et al.  3D multiresolution statistical approaches for accelerated medical image and volume segmentation , 2011 .

[9]  Tarek Kanan Extracting Named Entities Using Named Entity Recognizer for Arabic News Articles , 2016 .

[10]  Ala I. Al-Fuqaha,et al.  A survey on particle swarm optimization with emphasis on engineering and network applications , 2019, Evolutionary Intelligence.

[11]  Farshad Fotouhi,et al.  An efficient cold start solution based on group interests for recommender systems , 2018, DATA.

[12]  A. Al-Fuqaha,et al.  A genetic approach for trajectory planning in non-autonomous Mobile Ad-Hoc Networks with QoS requirements , 2010, 2010 IEEE Globecom Workshops.

[13]  Mohamed Boudchiche,et al.  AlKhalil Morpho Sys 2: A robust Arabic morpho-syntactic analyzer , 2017, J. King Saud Univ. Comput. Inf. Sci..

[14]  Khaled Shaalan,et al.  Rule-based Approach in Arabic Natural Language Processing , 2010 .

[15]  Edward A. Fox,et al.  Digital Library Educational Module Development Strategies and Sustainable Enhancement by the Community , 2010, ECDL.

[16]  Bilal Hawashin,et al.  A Secure Network Communication Protocol Based on Text to Barcode Encryption Algorithm , 2015 .

[17]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[18]  Mahmoud Al-Ayyoub,et al.  A Novel Recommender System Based on Apriori Algorithm for Requirements Engineering , 2018, 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS).

[19]  Edward A. Fox,et al.  Big Data Text Summarization for Events: A Problem Based Learning Course , 2015, JCDL.

[20]  Mohsen Guizani,et al.  A New Hierarchical and Adaptive Protocol for Minimum-Delay V2V Communication , 2009, GLOBECOM 2009 - 2009 IEEE Global Telecommunications Conference.

[21]  Mohammad Hijjawi,et al.  ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO THE ENGLISH LANGUAGE , 2015 .

[22]  Mahmoud Al-Ayyoub,et al.  Cross-Lingual Short-Text Document Classification for Facebook Comments , 2014, 2014 International Conference on Future Internet of Things and Cloud.

[23]  Jian-Yun Nie,et al.  Effective Stemming for Arabic Information Retrieval , 2006, BCS.

[24]  Abdelrahman Osman Elfaki,et al.  A Comparative Survey on Arabic Stemming: Approaches and Challenges , 2017 .

[25]  Yaser Jararweh,et al.  Parallel implementation for 3D medical volume fuzzy segmentation , 2020, Pattern Recognit. Lett..

[26]  Jafar J. Abukhait An Automated Surface Defect Inspection System Using Local Binary Patterns and Co-Occurrence Matrix Textures based on SVM Classifier , 2018 .

[27]  Al Zu'bi,et al.  3D multiresolution statistical approaches for accelerated medical image and volume segmentation , 2011 .

[28]  Mahmoud Al-Ayyoub,et al.  Enhancing the determination of aspect categories and their polarities in Arabic reviews using lexicon-based approaches , 2015, 2015 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT).

[29]  Mahmoud Al-Ayyoub,et al.  Enhanced 3D segmentation techniques for reconstructed 3D medical volumes: Robust and Accurate Intelligent System , 2017, EUSPN/ICTH.

[30]  Abdel Belaid,et al.  Arabic natural language processing , 2008 .

[31]  Mahmoud Al-Ayyoub,et al.  Paraphrase identification and semantic text similarity analysis in Arabic news tweets using lexical, syntactic, and semantic features , 2017, Inf. Process. Manag..

[32]  Ayman Mansour,et al.  Classification based on Gaussian-kernel Support Vector Machine with Adaptive Fuzzy Inference System , 2018 .

[33]  Amna A. Al Kaabi,et al.  Arabic Light Stemmer : Anew Enhanced Approach , 2005 .

[34]  Mona T. Diab,et al.  Second Generation AMIRA Tools for Arabic Processing : Fast and Robust Tokenization , POS tagging , and Base Phrase Chunking , 2009 .