Comparison of Jaccard, Dice, Cosine Similarity Coefficient To Find Best Fitness Value for Web Retrieved Documents Using Genetic Algorithm

A similarity coefficient represents the similarity between two documents, two queries, or one document and one query. The retrieved documents can also be ranked in the order of presumed importance. A similarity coefficient is a function which computes the degree of similarity between a pair of text objects. There are a large number of similarity coefficients proposed in the literature, because the best similarity measure doesn't exist (yet !). In this paper we do a comparative analysis for finding out the most relevant document for the given set of keyword by using three similarity coefficients viz Jaccard, Dice and Cosine coefficients. This we perform using genetic algorithm approach. Due to the randomized nature of genetic algorithm the best fitness value is the average of 10 runs of the same code for a fixed number of iterations.The similarity coefficient for a set of documents retrieved for a given query from Google are find out then average relevancy in terms of fitness values using similarity coefficients is calculated. In this paper we have averaged 10 different generations for each query by running the program 10 times for the fixed value of Probability of Crossover Pc=0.7 and Probability of Mutation Pm=0.01. The same experiment was conducted for 10 queries.