A novel term weighting scheme based on discrimination power obtained from past retrieval results

Term weighting for document ranking and retrieval has been an important research topic in information retrieval for decades. We propose a novel term weighting method based on a hypothesis that a term's role in accumulated retrieval sessions in the past affects its general importance regardless. It utilizes availability of past retrieval results consisting of the queries that contain a particular term, retrieved documents, and their relevance judgments. A term's evidential weight, as we propose in this paper, depends on the degree to which the mean frequency values for the relevant and non-relevant document distributions in the past are different. More precisely, it takes into account the rankings and similarity values of the relevant and non-relevant documents. Our experimental result using standard test collections shows that the proposed term weighting scheme improves conventional TF^*IDF and language model based schemes. It indicates that evidential term weights bring in a new aspect of term importance and complement the collection statistics based on TF^*IDF. We also show how the proposed term weighting scheme based on the notion of evidential weights are related to the well-known weighting schemes based on language modeling and probabilistic models.

[1]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[2]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[3]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[4]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[5]  Wei-Ying Ma,et al.  Optimizing web search using web click-through data , 2004, CIKM '04.

[6]  Stephen E. Robertson,et al.  On Term Selection for Query Expansion , 1991, J. Documentation.

[7]  Shlomo Moran,et al.  Predictive caching and prefetching of query results in search engines , 2003, WWW '03.

[8]  Jian-Yun Nie,et al.  Learning to Rank Documents for Ad-Hoc Retrieval with Regularized Models , 2007 .

[9]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[10]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[11]  Craig MacDonald,et al.  Terrier Information Retrieval Platform , 2005, ECIR.

[12]  Jane Greenberg,et al.  Using BM25F for semantic search , 2010, SEMSEARCH '10.

[13]  Wei-Pang Yang,et al.  Learning to Rank for Information Retrieval Using Genetic Programming , 2007 .

[14]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[15]  J. Rodgers,et al.  Thirteen ways to look at the correlation coefficient , 1988 .

[16]  Ronan Cummins,et al.  Evolving local and global weighting schemes in information retrieval , 2006, Information Retrieval.

[17]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[18]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[19]  Makoto Iwayama,et al.  Relevance feedback with a small number of relevance judgements: incremental relevance feedback vs. document clustering , 2000, SIGIR '00.

[20]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[21]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[22]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[23]  Gianni Amati,et al.  Probability models for information retrieval based on divergence from randomness , 2003 .

[24]  Eric Brill,et al.  Beyond PageRank: machine learning for static ranking , 2006, WWW '06.

[25]  James Allan,et al.  Incremental relevance feedback for information filtering , 1996, SIGIR '96.

[26]  Kevin Duh,et al.  Learning to rank with partially-labeled data , 2008, SIGIR '08.

[27]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[28]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[29]  Stephen E. Robertson,et al.  Okapi at TREC-4 , 1995, TREC.

[30]  Nicholas J. Belkin,et al.  Using Relevance Feedback and Ranking in Interactive Searching , 1995, TREC.