Malay document clustering using complete linkage clustering technique with Cosine Coefficient

Finding useful and relevant information is a very challenging task to the user. The retrieval system usually responded with a long listed documents which are not necessarily relevant to the user's need. Document clustering is a special technique that can sort out the documents effectively so that documents in the same cluster are similar to each other and documents in different cluster are dissimilar to each other. This paper focuses on document clustering for Malay test collection. It consists of 2028 Malay translated Hadith documents from book Sahih Bukhari. This paper presents the results using Complete Linkage Clustering algorithm with Cosine Coefficient on Malay translated Hadith documents. The evaluation of the experiments uses Recall (R), Precision (P) and Effectiveness (E) measure. The experiments is conducted on 100 clusters, 50 clusters and 20 clusters. It shows that the smaller the size of clusters, Recall (R) will increase, but Precision (P) will decrease. Results for Effectiveness (E) measure compared to the non-clustered documents show that applying clustering algorithm will improved the effectiveness of searching process. For this experiment 20 clusters is rather effective compared to the others.