Effect of stemming on text similarity for Arabic language at sentence level

Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmatization algorithms are used in this study, with a total of 10 algorithms. Standard training and testing data sets are used from SemEval-2017 international workshop for Task 1, Track 1 Arabic (ar–ar). Different features are selected to study the effect of stemming on text similarity based on different similarity measures. Traditional machine learning algorithms are used such as Support Vector Machines (SVM), Stochastic Gradient Descent (SGD) and Naïve Bayesian (NB). Compared to the original text, using the stemmed and lemmatized documents in experiments achieve enhanced Pearson correlation results. The best results attained when using Arabic light Stemmer (ARLSTem) and Farasa light stemmers, Farasa and Qalsadi Lemmatizers and Tashaphyne heavy stemmer. The best enhancement was about 7.34% in Pearson correlation. In general, stemming considerably improves the performance of sentence text similarly for Arabic language. However, some stemmers make results worse than those for original text; they are Khoja heavy stemmer and AlKhalil light stemmer.

[1]  Emad Fawzi Al-Shalabi An Automated System for Essay Scoring of Online Exams in Arabic based on Stemming Techniques and Levenshtein Edit Operations , 2016, ArXiv.

[2]  Kazem Taghva,et al.  Arabic stemming without a root dictionary , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[3]  Nadir Durrani,et al.  Farasa: A Fast and Furious Segmenter for Arabic , 2016, NAACL.

[4]  Abdelmajid Ben Hamadou,et al.  Enhancing the sentence similarity measure by semantic and syntactico-semantic knowledge , 2017, Vietnam Journal of Computer Science.

[5]  S. A. Ouatik,et al.  Stemming and similarity measures for Arabic Documents Clustering , 2010, 2010 5th International Symposium On I/V Communications and Mobile Network.

[6]  Jiapeng Wang,et al.  Measurement of Text Similarity: A Survey , 2020, Inf..

[7]  Aqil M. Azmi,et al.  Impact of Stemming and Word Embedding on Deep Learning-Based Arabic Text Categorization , 2020, IEEE Access.

[8]  Abdullah Bulbul,et al.  An intelligent use of stemmer and morphology analysis for Arabic information retrieval , 2020 .

[9]  Khaled Shaalan,et al.  Arabic Natural Language Processing: Challenges and Solutions , 2009, TALIP.

[10]  Samar Al-Saqqa,et al.  Stemming Effects on Sentiment Analysis using Large Arabic Multi-Domain Resources , 2019, 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS).

[11]  Abdelmonaime Lachkar,et al.  Stemming versus Light Stemming for measuring the simitilarity between Arabic Words with Latent Semantic Analysis model , 2012, 2012 Colloquium in Information Science and Technology.

[12]  Eneko Agirre,et al.  *SEM 2013 shared task: Semantic Textual Similarity , 2013, *SEMEVAL.

[13]  Hanane Froud,et al.  A comparative study of root-based and stem-based approaches for measuring the similarity between arabic words for arabic text mining applications , 2012 .

[14]  Jinyong Cheng,et al.  QLUT at SemEval-2017 Task 1: Semantic Textual Similarity Based on Word Embeddings , 2017, SemEval@ACL.

[15]  Mamdouh Farouk,et al.  Measuring Sentences Similarity: A Survey , 2019, Indian Journal of Science and Technology.

[16]  Mounir Zrigui,et al.  Sentence Embedding and Convolutional Neural Network for Semantic Textual Similarity Detection in Arabic Language , 2019 .

[17]  Didier Schwab,et al.  Semantic Similarity of Arabic Sentences with Word Embeddings , 2017, WANLP@EACL.

[18]  Suhad Malallah kadhem,et al.  Finding the Similarity between Two Arabic Texts , 2017 .

[19]  Abdelhak Lakhouaja,et al.  Arabic information retrieval: Stemming or lemmatization? , 2017, 2017 Intelligent Systems and Computer Vision (ISCV).

[20]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[21]  Hao Wu,et al.  BIT at SemEval-2017 Task 1: Using Semantic Information Space to Evaluate Semantic Textual Similarity , 2017, *SEMEVAL.

[22]  Man Lan,et al.  ECNU at SemEval-2017 Task 1: Leverage Kernel-based Traditional NLP features and Neural Networks to Build a Universal Model for Multilingual and Cross-lingual Semantic Textual Similarity , 2017, SemEval@ACL.

[23]  Mohamed Boudchiche,et al.  AlKhalil Morpho Sys 2: A robust Arabic morpho-syntactic analyzer , 2017, J. King Saud Univ. Comput. Inf. Sci..

[24]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[25]  Dongdong Zhao,et al.  A Study of the Effects of Stemming Strategies on Arabic Document Classification , 2019, IEEE Access.

[26]  Suleiman H. Mustafa,et al.  N-Gram-Based Techniques for Arabic Text Document Matching; Case Study: Courses Accreditation , 2012 .

[27]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[28]  Kheireddine Abainia,et al.  A novel robust Arabic light stemmer , 2017, J. Exp. Theor. Artif. Intell..

[29]  Wael Hassan Gomaa,et al.  A Survey of Text Similarity Approaches , 2013 .

[30]  Arafat Awajan,et al.  Semantic Similarity for English and Arabic Texts: A Review , 2020 .