A retrospective study of a hybrid document-context based retrieval model

This paper describes our novel retrieval model that is based on contexts of query terms in documents (i.e., document contexts). Our model is novel because it explicitly takes into account of the document contexts instead of implicitly using the document contexts to find query expansion terms. Our model is based on simulating a user making relevance decisions, and it is a hybrid of various existing effective models and techniques. It estimates the relevance decision preference of a document context as the log-odds and uses smoothing techniques as found in language models to solve the problem of zero probabilities. It combines these estimated preferences of document contexts using different types of aggregation operators that comply with different relevance decision principles (e.g., aggregate relevance principle). Our model is evaluated using retrospective experiments (i.e., with full relevance information), because such experiments can (a) reveal the potential of our model, (b) isolate the problems of the model from those of the parameter estimation, (c) provide information about the major factors affecting the retrieval effectiveness of the model, and (d) show that whether the model obeys the probability ranking principle. Our model is promising as its mean average precision is 60-80% in our experiments using different TREC ad hoc English collections and the NTCIR-5 ad hoc Chinese collection. Our experiments showed that (a) the operators that are consistent with aggregate relevance principle were effective in combining the estimated preferences, and (b) that estimating probabilities using the contexts in the relevant documents can produce better retrieval effectiveness than using the entire relevant documents.

[1]  Yiyu Yao,et al.  Preference structure, inference and set-oriented retrieval , 1991, SIGIR '91.

[2]  Jianfeng Gao,et al.  Dependence language model for information retrieval , 2004, SIGIR '04.

[3]  Peter Bruza,et al.  Towards context sensitive information inference , 2003, J. Assoc. Inf. Sci. Technol..

[4]  Jian-Yun Nie,et al.  Query expansion using term relationships in language models for information retrieval , 2005, CIKM '05.

[5]  Donna Harman,et al.  Information Processing and Management , 2022 .

[6]  Curt Burgess,et al.  Modelling Parsing Constraints with High-dimensional Context Space , 1997 .

[7]  Ronald R. Yager,et al.  On ordered weighted averaging aggregation operators in multicriteria decision-making , 1988 .

[8]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[9]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[10]  Kam-Fai Wong,et al.  Interpreting TF-IDF term weights as making relevance decisions , 2008, TOIS.

[11]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[12]  William S. Cooper,et al.  Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval , 1995, TOIS.

[13]  ChengXiang Zhai,et al.  Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[14]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[15]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[16]  Justin Zobel,et al.  Passage retrieval revisited , 1997, SIGIR '97.

[17]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[18]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing , 1974 .

[19]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[20]  James Allan,et al.  Approaches to passage retrieval in full text information systems , 1993, SIGIR.

[21]  Hsin-Hsi Chen,et al.  Overview of CLIR Task at the Sixth NTCIR Workshop , 2005, NTCIR.

[22]  Stanley F. Chen,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[23]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[24]  Curt Burgess,et al.  Explorations in context space: Words, sentences, discourse , 1998 .

[25]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[26]  Ronald R. Yager,et al.  On ordered weighted averaging aggregation operators in multicriteria decisionmaking , 1988, IEEE Trans. Syst. Man Cybern..

[27]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[28]  Edward A. Fox,et al.  Extended Boolean Models , 1992, Information retrieval (Boston).

[29]  Hsin-Hsi Chen,et al.  Overview of CLIR Task at the Fourth NTCIR Workshop , 2004, NTCIR.

[30]  Edward A. Fox,et al.  Research Contributions , 2014 .

[31]  Sadaaki Miyamoto,et al.  Fuzzy Sets in Information Retrieval and Cluster Analysis , 1990, Theory and Decision Library.

[32]  L. M. M.-T. Theory of Probability , 1929, Nature.

[33]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[34]  Donald H. Kraft,et al.  A mathematical model of a weighted boolean retrieval system , 1979, Inf. Process. Manag..

[35]  Djoerd Hiemstra,et al.  Relevance Feedback for Best Match Term Weighting Algorithms in Information Retrieval , 2001, DELOS.

[36]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[37]  W. Bruce Croft,et al.  Passage retrieval based on language models , 2002, CIKM '02.

[38]  Christopher S. G. Khoo,et al.  Incorporating window-based passage-level evidence in document retrieval , 2001, J. Inf. Sci..

[39]  C. Paice Soft evaluation of Boolean search queries in information retrieval systems , 1984 .

[40]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[41]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature , 1975, J. Am. Soc. Inf. Sci..

[42]  S. K. Michael Wong,et al.  Adaptive linear information retrieval models , 1987, SIGIR '87.

[43]  W. Bruce Croft,et al.  Improving the effectiveness of information retrieval with local context analysis , 2000, TOIS.

[44]  J. Dombi A general class of fuzzy operators, the demorgan class of fuzzy operators and fuzziness measures induced by fuzzy operators , 1982 .

[45]  Gary Geunbae Lee,et al.  Automatic corpus-based tone and break-index prediction using K-ToBI representation , 2002, TALIP.

[46]  Sándor Dominich A unified mathematical definition of classical information retrieval , 2000, J. Am. Soc. Inf. Sci..

[47]  Peter Bruza,et al.  A comparison of various approaches for using probabilistic dependencies in language modeling , 2003, SIGIR '03.

[48]  Stephen E. Robertson,et al.  Okapi at TREC-6 Automatic ad hoc, VLC, routing, filtering and QSDR , 1997, TREC.

[49]  W. Bruce Croft,et al.  Passage retrieval based on language models , 2002, CIKM.

[50]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[51]  Stephen E. Robertson,et al.  On document relevance and lexical cohesion between query terms , 2006, Inf. Process. Manag..

[52]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[53]  Kam-Fai Wong,et al.  A retrospective study of probabilistic context-based retrieval , 2005, SIGIR '05.

[54]  C. Buckley,et al.  Reliable Information Access Final Workshop Report , 2004 .

[55]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[56]  Fredric C. Gey,et al.  Probabilistic retrieval based on staged logistic regression , 1992, SIGIR '92.

[57]  Olga Vechtomova,et al.  Integration of Collocation Statistics into the Probabilistic Retrieval Model , 2002 .

[58]  Ron Sacks-Davis,et al.  Efficient passage ranking for document databases , 1999, TOIS.

[59]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[60]  S. Robertson The probability ranking principle in IR , 1997 .

[61]  Kui-Lam Kwok,et al.  A comparison of Chinese document indexing strategies and retrieval models , 2002, TALIP.

[62]  W. E. Johnson I.—PROBABILITY: THE DEDUCTIVE AND INDUCTIVE PROBLEMS , 1932 .

[63]  W. Pedrycz,et al.  Generalized means as model of compensative connectives , 1984 .