A Comparison of Collocation-Based Similarity Measures in Query Expansion

In this paper, we present a comparison of collocation-based similarity measures: Jaccard, Dice and Cosine similarity measures for the proper selection of additional search terms in query expansion. In addition, we consider two more similarity measures: average conditional probability (ACP) and normalized mutual information (NMI). ACP is the mean value of two conditional probabilities between a query term and an additional search term. NMI is a normalized value of the two terms' mutual information. All these similarity measures are the functions of any two terms' frequencies and the collocation frequency, but are different in the methods of measurement. The selected measure changes the order of additional search terms and their weights, hence has a strong influence on the retrieval performance. In our experiments of query expansion using these five similarity measures, the additional search terms of Jaccard, Dice and Cosine similarity measures include more frequent terms with lower similarity values than ACP or NMI. In overall assessments of query expansion, the Jaccard, Dice and Cosine similarity measures are better than ACP and NMI in terms of retrieval effectiveness, whereas, NMI and ACP are better in terms of execution efficiency.

[1]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[2]  Gerald Salton,et al.  Automatic text processing , 1988 .

[3]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[4]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[5]  H. Kang,et al.  Two-Level Document Ranking Using Mutual Information in Natural Language Information Retrieval , 1997, Inf. Process. Manag..

[6]  Peter Willett,et al.  The Limitations of Term Co-Occurrence Data for Query Expansion in Document Retrieval Systems , 1991 .

[7]  Tetsuya Morita,et al.  A fuzzy document retrieval system using the keyword connection matrix and a learning method , 1991 .

[8]  John Sinclair,et al.  Corpus, Concordance, Collocation , 1991 .

[9]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[10]  Kui-Lam Kwok Comparing representations in Chinese information retrieval , 1997, SIGIR '97.

[11]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[12]  Key-Sun Choi,et al.  Automatic thesaurus construction using Bayesian networks , 1995, CIKM '95.

[13]  Key-Sun Choi,et al.  Query expansion using domain-adapted, weighted thesaurus in an extended Boolean model , 1994, CIKM '94.

[14]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[15]  S. Miyamoto Information retrieval based on fuzzy associations , 1990 .

[16]  Peter Willett,et al.  The limitations of term co-occurrence data for query expansion in document retrieval systems , 1991, J. Am. Soc. Inf. Sci..

[17]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .