Assessing multivariate Bernoulli models for information retrieval

Although the seminal proposal to introduce language modeling in information retrieval was based on a multivariate Bernoulli model, the predominant modeling approach is now centered on multinomial models. Language modeling for retrieval based on multivariate Bernoulli distributions is seen inefficient and believed less effective than the multinomial model. In this article, we examine the multivariate Bernoulli model with respect to its successor and examine its role in future retrieval systems. In the context of Bayesian learning, these two modeling approaches are described, contrasted, and compared both theoretically and computationally. We show that the query likelihood following a multivariate Bernoulli distribution introduces interesting retrieval features which may be useful for specific retrieval tasks such as sentence retrieval. Then, we address the efficiency aspect and show that algorithms can be designed to perform retrieval efficiently for multivariate Bernoulli models, before performing an empirical comparison to study the behaviorial aspects of the models. A series of comparisons is then conducted on a number of test collections and retrieval tasks to determine the empirical and practical differences between the different models. Our results indicate that for sentence retrieval the multivariate Bernoulli model can significantly outperform the multinomial model. However, for the other tasks the multinomial model provides consistently better performance (and in most cases significantly so). An analysis of the various retrieval characteristics reveals that the multivariate Bernoulli model tends to promote long documents whose nonquery terms are informative. While this is detrimental to the task of document retrieval (documents tend to contain considerable nonquery content), it is valuable for other tasks such as sentence retrieval, where the retrieved elements are very short and focused.

[1]  ChengXiang Zhai,et al.  Risk minimization and language modeling in text retrieval dissertation abstract , 2002, SIGF.

[2]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[3]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[4]  Jaap Kamps,et al.  Web-centric language models , 2005, CIKM '05.

[5]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[6]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[7]  Leif Azzopardi,et al.  An Efficient Computation of the Multiple-Bernoulli Language Model , 2006, ECIR.

[8]  Walter L. Smith Probability and Statistics , 1959, Nature.

[9]  P. Laplace A Philosophical Essay On Probabilities , 1902 .

[10]  Djoerd Hiemstra,et al.  A probabilistic justification for using tf×idf term weighting in information retrieval , 2000, International Journal on Digital Libraries.

[11]  Donna K. Harman,et al.  Overview of the TREC 2003 Novelty Track , 2003, TREC.

[12]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[13]  Gianni Amati,et al.  Frequentist and Bayesian Approach to Information Retrieval , 2006, ECIR.

[14]  Ian Soboroff,et al.  Overview of the TREC 2004 Novelty Track , 2004, TREC.

[15]  Wessel Kraaij,et al.  Variations on language modeling for information retrieval , 2005, SIGF.

[16]  Djoerd Hiemstra,et al.  Bayesian extension to the language model for ad hoc information retrieval , 2003, SIGIR.

[17]  James Allan,et al.  UMass at TREC 2002: Cross Language and Novelty Tracks , 2002, TREC.

[18]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[19]  Djoerd Hiemstra,et al.  The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[20]  M. Degroot,et al.  Probability and Statistics , 2021, Examining an Operational Approach to Teaching Probability.

[21]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[22]  D. Losada Language modeling for sentence retrieval : A comparison between Multiple-Bernoulli models and Multinomial models , 2005 .

[23]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[24]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[25]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[26]  W. Bruce Croft Advances in Informational Retrieval: Recent Research from the Center for Intelligent Information Retrieval , 2000 .

[27]  Leif Azzopardi,et al.  An analysis on document length retrieval trends in language modeling smoothing , 2008, Information Retrieval.

[28]  Mark Baillie,et al.  A Retrieval Evaluation Methodology for Incomplete Relevance Assessments , 2007, ECIR.

[29]  Donna Harman,et al.  Overview of the First Text REtrieval Conference. , 1993, SIGIR 1993.

[30]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[31]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[32]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[33]  Donna K. Harman,et al.  Overview of the TREC 2002 Novelty Track , 2002, TREC.

[34]  Robert Wing Pong Luk,et al.  A Generative Theory of Relevance , 2008, The Information Retrieval Series.

[35]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[36]  E LosadaDavid,et al.  Assessing multivariate Bernoulli models for information retrieval , 2008 .

[37]  W. Bruce Croft,et al.  Formal multiple-bernoulli models for language modeling , 2004, SIGIR '04.

[38]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[39]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[40]  Leif Azzopardi,et al.  Age Dependent Document Priors in Link Structure Analysis , 2005, ECIR.

[41]  W. Bruce Croft Advances in Information Retrieval , 2000, The Information Retrieval Series.

[42]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[43]  Vanessa Murdock,et al.  Aspects of sentence retrieval , 2007, SIGF.

[44]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[45]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[46]  Victor Lavrenko,et al.  A Generative Theory of Relevance , 2008, The Information Retrieval Series.

[47]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .