A comparative study of root-based and stem-based approaches for measuring the similarity between arabic words for arabic text mining applications

Representation of semantic information contained in the words is needed for any Arabic Text Mining applications. More precisely, the purpose is to better take into account the semantic dependencies between words expressed by the co-occurrence frequencies of these words. There have been many proposals to compute similarities between words based on their distributions in contexts. In this paper, we compare and contrast the effect of two preprocessing techniques applied to Arabic corpus: Rootbased (Stemming), and Stem-based (Light Stemming) approaches for measuring the similarity between Arabic words with the well known abstractive model -Latent Semantic Analysis (LSA)- with a wide variety of distance functions and similarity measures, such as the Euclidean Distance, Cosine Similarity, Jaccard Coefficient, and the Pearson Correlation Coefficient. The obtained results show that, on the one hand, the variety of the corpus produces more accurate results; on the other hand, the Stem-based approach outperformed the Root-based one because this latter affects the words meanings.

[1]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[2]  Fredric C. Gey,et al.  Building an Arabic Stemmer for Information Retrieval , 2002, TREC.

[3]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[4]  Marc El-Bèze,et al.  Clustering by means of unsupervised decision trees or hierarchical and K-means-like algorithm , 2000 .

[5]  Cédrick Bellissens,et al.  Deux modèles vectoriels de la mémoire sémantique : description, théorie et perspectives , 2004 .

[6]  A. Lachkar,et al.  Stemming for Arabic words similarity measures based on Latent Semantic Analysis model , 2012, 2012 International Conference on Multimedia Computing and Systems.

[7]  Riyad Al-Shalabi,et al.  A Computational Morphology System for Arabic , 1998, SEMITIC@COLING.

[8]  R. Duwairi,et al.  Stemming Versus Light Stemming as Feature Selection Techniques for Arabic Text Categorization , 2007, 2007 Innovations in Information Technologies (IIT).

[9]  S. A. Ouatik,et al.  Stemming and similarity measures for Arabic Documents Clustering , 2010, 2010 5th International Symposium On I/V Communications and Mobile Network.

[10]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[11]  Sameh Ghwanmeh Applying Clustering of Hierarchical K-means-like Algorithm on Arabic Language , 2007 .

[12]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[13]  Leah S. Larkey,et al.  Arabic Information Retrieval at UMass in TREC-10 , 2001, TREC.

[14]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[15]  Ophir Frieder,et al.  On arabic search: improving the retrieval effectiveness via a light stemming approach , 2002, CIKM '02.

[16]  Hinrich Schütze,et al.  Projections for efficient document clustering , 1997, SIGIR '97.