Examining the information retrieval process from an inductive perspective

Term-weighting functions derived from various models of retrieval aim to model human notions of relevance more accurately. However, there is a lack of analysis of the sources of evidence from which important features of these term weighting schemes originate. In general, features pertaining to these term-weighting schemes can be collected from (1) the document, (2) the entire collection and (3) the query. In this work, we perform an empirical analysis to determine the increase in effectiveness as information from these three different sources becomes more accurate. First, we determine the number of documents to be indexed to accurately estimate collection-wide features to obtain near optimal effectiveness for a range of a term-weighting functions. Similarly, we determine the amount of a document and query that must be sampled to achieve near-peak effectiveness. This analysis also allows us to determine the factors that contribute most to the performance of a term-weighting function (i.e. the document, the collection or the query). We use our framework to construct a new model of weighting where we discard the 'bag of words' model and aim to retrieve documents based on the initial physical representation of a document using some basic axioms of retrieval. We show that this is a good first step towards incorporating some more interesting features into a term-weighting function

[1]  Tao Tao,et al.  An exploration of proximity measures in information retrieval , 2007, SIGIR.

[2]  ChengXiang Zhai,et al.  Semantic term matching in axiomatic approaches to information retrieval , 2006, SIGIR.

[3]  Leif Azzopardi Query side evaluation: an empirical analysis of effectiveness and effort , 2009, SIGIR.

[4]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[5]  Alistair Moffat,et al.  Exploring the similarity space , 1998, SIGF.

[6]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[7]  Hans Friedrich Witschel Estimation of global term weights for distributed and ubiquitous IR , 2006 .

[8]  ChengXiang Zhai,et al.  An exploration of axiomatic approaches to information retrieval , 2005, SIGIR '05.

[9]  Ronan Cummins,et al.  Measuring constraint violations in information retrieval , 2009, SIGIR.

[10]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[11]  Ronan Cummins,et al.  Evolving local and global weighting schemes in information retrieval , 2006, Information Retrieval.

[12]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[13]  Ling Liu,et al.  Distributed query sampling: a quality-conscious approach , 2006, SIGIR '06.

[14]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[15]  Ronan Cummins,et al.  Learning in a pairwise term-term proximity framework for information retrieval , 2009, SIGIR.

[16]  Fernando Diaz,et al.  Sources of evidence for vertical selection , 2009, SIGIR.

[17]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.

[18]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.