Selecting Effective Expansion Terms for Better Information Retrieval

Automatic Query expansion is a well-known method to improve the performance of information retrieval systems. In this paper, we consider methods to extract the candidate terms for automatic query expansion, based on co-occurrence information from psuedo relevant documents. The objective of the paper is: to present to user different ways of selecting and ranking co-occurring terms and to suggest use of information theoretic measures for ranking the co-occurring terms selected, in order to improve retrieval efficiency. Specifically in our work, we have used two information theoretic measures: Kullback-Leibler divergence (KLD) and a variant of KLD. These measures are based on relative entropy between top documents and entire collection. We have compared the retrieval improvement achieved by expanding the query with terms obtained with different methods belonging to both approaches (co occurrence based and information theoretic). Experiments have been performed on TREC-1 data set. Intensive experiments have been done to select suitable parameters used in automatic query expansion, such as number of top n selected documents and the number of terms selected for expansion. Results suggest that firstly considerable improvements can be achieved if co-occurring terms are selected properly by considering different options available for selecting them. Secondly, information theoretic measures applied over co-occurring terms can be helpful in improving retrieval efficiency.

[1]  Gerard Salton,et al.  Comment on "an evaluation of query expansion by the addition of clustered terms for a document retrieval system" , 1972, Inf. Storage Retr..

[2]  Patrick Ruch,et al.  Argumentative Feedback: A Linguistically-Motivated Term Expansion for Information Retrieval , 2006, ACL.

[3]  C. J. van Rijsbergen,et al.  An Evaluation of feedback in Document Retrieval using Co‐Occurrence Data , 1978, J. Documentation.

[4]  Clement T. Yu,et al.  An effective approach to document retrieval via utilizing WordNet and recognizing phrases , 2004, SIGIR '04.

[5]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[6]  C. J. van Rijsbergen,et al.  The selection of good search terms , 1981, Inf. Process. Manag..

[7]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[8]  W. Bruce Croft,et al.  Improving the effectiveness of information retrieval with local context analysis , 2000, TOIS.

[9]  Elad Yom-Tov,et al.  SIGIR workshop report: predicting query difficulty - methods and applications , 2005, SIGF.

[10]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[11]  Alan F. Smeaton,et al.  The Retrieval Effects of Query Expansion on a Feedback Document Retrieval System , 1983, Comput. J..

[12]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[13]  Peter Willett,et al.  The Limitations of Term Co-Occurrence Data for Query Expansion in Document Retrieval Systems , 1991 .

[14]  Mark A. Stairmand Textual context analysis for information retrieval , 1997, SIGIR '97.

[15]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[16]  Ellen M. Voorhees,et al.  The TREC 2005 robust track , 2006, SIGF.

[17]  Michael Lesk,et al.  Word-word associations in document retrieval systems , 1969 .

[18]  Ellen M. Voorhees,et al.  The TREC robust retrieval track , 2005, SIGF.

[19]  Hans-Peter Frei,et al.  Concept based query expansion , 1993, SIGIR.

[20]  Aditi Sharan,et al.  THESAURUS AND QUERY EXPANSION , 2009 .

[21]  Pu-Jen Cheng,et al.  Selecting Effective Terms for Query Formulation , 2009, AIRS.

[22]  Stephen E. Robertson,et al.  Selecting good expansion terms for pseudo-relevance feedback , 2008, SIGIR '08.

[23]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[24]  Takenobu Tokunaga,et al.  Combining multiple evidence from different types of thesaurus for query expansion , 1999, SIGIR '99.

[25]  Karen Sparck Jones Automatic keyword classification for information retrieval , 1971 .

[26]  Takenobu Tokunaga,et al.  Ad Hoc Retrieval Experiments Using WordNet and Automatically Constructed Thesauri , 1998, TREC.

[27]  W. Bruce Croft,et al.  An Association Thesaurus for Information Retrieval , 1994, RIAO.

[28]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[29]  Hinrich Schütze,et al.  A Cooccurrence-Based Thesaurus and Two Applications to Information Retrieval , 1994, Inf. Process. Manag..