The D2Q2 Framework: On the Relationship and Combination of Language Modelling and TF-IDF

Language Modelling (LM) and TF-IDF are two retrieval models with different foundations. There have been efforts aiming at establishing the relationship between these models, and whether one includes the other. Whether their combination could yield a third and better model is an open research question. This paper revisits the foundations of LM and TF-IDF and explores how these models’ bare structures relate and how these structures can be combined. We begin with the premise that TF-IDF is the P (d|q)/P (d) side of retrieval, which complements the common view that LM is P (q|d)/P (q). Next, a hybrid framework based on the decomposition of the product of the two sides, P (d|q)/P (d) · P (q|d)/P (q), is developed. This leads to the D2Q2 family of models, which joins the inner components of LM and TF-IDF instead of combining their scores. This paper provides new insights into the relationship between LM and TF-IDF, and experimental results show that the D2Q2 models perform comparably to competitive baselines.

[1]  Thomas Roelleke,et al.  Semi-subsumed Events: A Probabilistic Semantics of the BM25 Term Frequency Quantification , 2009, ICTIR.

[2]  M. Tulder Chapter 1 , 2006, European Spine Journal.

[3]  Jian-Yun Nie,et al.  Towards a probabilistic modal logic for semantic-based information retrieval , 1992, SIGIR '92.

[4]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[5]  W. Bruce Croft,et al.  Interactive retrieval of complex documents , 1990, Inf. Process. Manag..

[6]  Iadh Ounis,et al.  A study of the dirichlet priors for term frequency normalisation , 2005, SIGIR '05.

[7]  Kam-Fai Wong,et al.  Interpreting TF-IDF term weights as making relevance decisions , 2008, TOIS.

[8]  Stephen E. Robertson,et al.  Optimisation methods for ranking functions with multiple parameters , 2006, CIKM '06.

[9]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[10]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[11]  ChengXiang Zhai,et al.  An exploration of axiomatic approaches to information retrieval , 2005, SIGIR '05.

[13]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[14]  Kenneth Ward Church,et al.  Identifying Word Correspondences in Parallel Texts , 1991, HLT.

[15]  Craig MacDonald,et al.  Overview of the TREC 2006 Blog Track , 2006, TREC.

[16]  Clement T. Yu,et al.  Automatic indexing using term discrimination and term precision measurements , 1976, Information Processing & Management.

[17]  Kenneth Ward Church,et al.  Identifying word correspondence in parallel texts , 1991 .

[18]  Thomas Roelleke,et al.  TF-IDF uncovered: a study of theories and probabilities , 2008, SIGIR '08.

[19]  W. Bruce Croft Combining Approaches to Information Retrieval , 2002 .

[20]  Garrison W. Cottrell,et al.  Automatic combination of multiple ranked retrieval systems , 1994, SIGIR '94.

[21]  ChengXiang Zhai,et al.  Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[22]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[23]  Dawei Song,et al.  Pure High-Order Word Dependence Mining via Information Geometry , 2011, ICTIR.

[24]  Djoerd Hiemstra,et al.  A probabilistic justification for using tf×idf term weighting in information retrieval , 2000, International Journal on Digital Libraries.

[25]  Thomas Roelleke,et al.  A parallel derivation of probabilistic information retrieval models , 2006, SIGIR.

[26]  Yiyu Yao,et al.  On modeling information retrieval with probabilistic inference , 1995, TOIS.

[27]  Kenneth Ward Church,et al.  Inverse Document Frequency (IDF): A Measure of Deviations from Poisson , 1995, VLC@ACL.

[28]  Jianfeng Gao,et al.  Dependence language model for information retrieval , 2004, SIGIR '04.

[29]  Kui-Lam Kwok,et al.  A new method of weighting query terms for ad-hoc retrieval , 1996, SIGIR '96.

[30]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.