论文信息 - Stemming and similarity measures for Arabic Documents Clustering

Stemming and similarity measures for Arabic Documents Clustering

Arabic Documents Clustering is an important task for obtaining good results with the traditional Information Retrieval (TR) systems especially with the rapid growth of the number of online documents present in Arabic language. Document clustering aims to automatically group similar documents in one cluster using different similarity/distance measures. In this paper, we evaluate the impact of the stemming on the Arabic Text Document Clustering with five similarity/distance measures: Euclidean Distance, Cosine Similarity, Jaccard Coefficient, Pearson Correlation Coefficient and Averaged Kullback-Leibler Divergence, for the testing dataset. Our experiments on this latter show that the use of the stemming will not yield good results, but makes the representation of the document smaller and the clustering faster.

[1] Eric Atwell,et al. The design of a corpus of Contemporary Arabic , 2006 .

[2] Amine Bensaid,et al. Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[3] George Karypis,et al. Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[4] Sameh Ghwanmeh. Applying Clustering of Hierarchical K-means-like Algorithm on Arabic Language , 2007 .

[5] Marc El-Bèze,et al. Clustering by means of unsupervised decision trees or hierarchical and K-means-like algorithm , 2000 .

[6] George Karypis,et al. Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[7] Chinatsu Aone,et al. Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[8] Naftali Tishby,et al. The information bottleneck method , 2000, ArXiv.

[9] Anna-Lan Huang,et al. Similarity Measures for Text Document Clustering , 2008 .

[10] Lisa Ballesteros,et al. Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.