Distance Measures and Stemming Impact on ‎Arabic Document Clustering

Clustering of Arabic documents is considered as a vital aspect ‎of obtaining optimal results from unsupervised learning. ‎Its aim ‎is to automatically group similar documents into a single cluster ‎using different similarities or distance measures. ‎However, ‎diverse similarities and distance measures are available and their ‎effectiveness in document clustering with a ‎syntactic structure ‎of the stemming is still not obvious. Therefore,‎‏ this study aims to evaluate the impact of five ‎similarity/distance measures (i.e., cosine similarity, the Jaccard coefficient, Pearson’s correlation coefficient, Euclidean ‎distance, and averaged Kullback-Leibler divergence) with two stemming algorithms (i.e., morphology- and syntax-based ‎lemmatization; and morphology-based Information Science Research Institute (ISRI) stemming on clustering Arabic ‎text dataset. We aim to identify the best performing similarity and distance measures and determine which measure is ‎most suitable for Arabic document clustering. Our experimental method, which is based on syntactic structure and ‎morphology, outperformed other stemming methods that use any of the five similarity/distance measures for Arabic ‎document clustering. The best performing similarity/distance measures are cosine similarity and Euclidean distance‎, respectively.

[1]  Masnizah Mohd,et al.  Effect of ISRI Stemming on Similarity Measure for Arabic Document Clustering , 2011, AIRS.

[2]  S. A. Ouatik,et al.  Stemming and similarity measures for Arabic Documents Clustering , 2010, 2010 5th International Symposium On I/V Communications and Mobile Network.

[3]  Cheng Soon Ong,et al.  On designing an automated Malaysian stemmer for the Malay language (poster session) , 2000, IRAL '00.

[4]  Maarten de Rijke,et al.  Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian , 2001, CLEF.

[5]  Michael F. Lynch,et al.  Stemming and N-gram matching for term conflation in Turkish texts , 1996, Information Research.

[6]  S. Khoja,et al.  APT: Arabic Part-of-speech Tagger , 2001 .

[7]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[8]  Kazem Taghva,et al.  Arabic stemming without a root dictionary , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[9]  Martha W. Evens,et al.  Stemming methodologies over individual query words for an Arabic information retrieval system , 1999 .

[10]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[11]  Bassam Al-Salemi,et al.  Statistical Bayesian Learning for Automatic Arabic Text Categorization , 2011 .

[12]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[13]  Laila Khreisat,et al.  Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study , 2006, DMIN.

[14]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[15]  Peter Willett,et al.  The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data , 1992, J. Am. Soc. Inf. Sci..

[16]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[17]  Wessel Kraaij,et al.  Viewing stemming as recall enhancement , 1996, SIGIR '96.

[18]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[19]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 1999 .

[20]  Stéphane Bressan,et al.  Indexing the Indonesian Web: Language Identification and Miscellaneous Issues , 2001, WWW Posters.

[21]  Martha W. Evens,et al.  Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System , 1999, J. Am. Soc. Inf. Sci..

[22]  Ola Knutsson,et al.  Improving Precision in Information Retrieval for Swedish using Stemming , 2001, NODALIDA.

[23]  Peter Willett,et al.  The effectiveness of stemming for natural‐language access to Slovene textual data , 1992 .

[24]  Martti Juhola,et al.  Stemming and lemmatization in the clustering of finnish text documents , 2004, CIKM '04.

[25]  Anil Kumar Patidar,et al.  Analysis of Different Similarity Measure Functions and their Impacts on Shared Nearest Neighbor Clustering Approach , 2012 .

[26]  Peter Willett,et al.  Processing morphological variants in searches of Latin text , 1996, Information Research.

[27]  Carol Peters,et al.  Cross-Language Information Retrieval and Evaluation , 2001, Lecture Notes in Computer Science.