Evaluating the Performance of Similarity Measures Used in Document Clustering and Information Retrieval

This paper presents the results of an experimental study of some similarity measures used for both Information Retrieval and Document Clustering. Our results indicate that the cosine similarity measure is superior than the other measures such as Jaccard measure, Euclidean measure that we tested. Cosine Similarity measure is particularly better for text documents. Previously these measures are compared with the conventional text datasets but the proposed system collects the datasets with the help of API and it returns the collection of XML pages. These XML pages are parsed and filtered to get the web document datasets. In this paper, we compare and analyze the effectiveness of these measures for these web document datasets.

[1]  Gerald Salton,et al.  Automatic text processing , 1988 .

[2]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[3]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[4]  Natalie S. Glance,et al.  Community search assistant , 2001, IUI '01.

[5]  Osmar R. Zaïane,et al.  Finding Similar Queries to Satisfy Searches Based on Query Traces , 2002, OOIS Workshops.

[6]  Ian H. Witten,et al.  Mining Domain-Specific Thesauri from Wikipedia: A Case Study , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[7]  E.V. Prasad,et al.  Web Document Clustering Technique Using Case Grammar Structure , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[8]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[9]  A. Christy,et al.  Intelligent Information Extraction with Soft Matching Rules and Knowledge Discovery Using Genetic Algorithm for Text Mining , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[10]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[11]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[12]  Mehmet Ali Salahli An Approach for Measuring Semantic Relatedness between Words via Related Terms , 2009 .

[13]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[14]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.