Empirical development of an exponential probabilistic model for text retrieval: using textual analysis to build a better model

Much work in information retrieval focuses on using a model of documents and queries to derive retrieval algorithms. Model based development is a useful alternative to heuristic development because in a model the assumptions are explicit and can be examined and refined independent of the particular retrieval algorithm. We explore the explicit assumptions underlying the naïve framework by performing computational analysis of actual corpora and queries to devise a generative document model that closely matches text. Our thesis is that a model so developed will be more accurate than existing models, and thus more useful in retrieval, as well as other applications. We test this by learning from a corpus the best document model. We find the learned model better predicts the existence of text data and has improved performance on certain IR tasks.

[1]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[2]  Rong Jin,et al.  Title language model for information retrieval , 2002, SIGIR '02.

[3]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1998, Learning in Graphical Models.

[4]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[5]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[6]  Warren R. Greiff,et al.  A theory of term weighting based on exploratory data analysis , 1998, SIGIR '98.

[7]  Regina Barzilay,et al.  Towards Multidocument Summarization by Reformulation: Progress and Prospects , 1999, AAAI/IAAI.

[8]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[9]  Larry Fitzpatrick,et al.  Automatic feedback using past queries: social searching? , 1997, SIGIR '97.

[10]  Slava M. Katz Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[11]  Peter Willett,et al.  Using interdocument similarity information in document retrieval systems , 1997 .

[12]  David Madigan,et al.  On the Naive Bayes Model for Text Categorization , 2003, AISTATS.

[13]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[14]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[15]  Thomas Kalt,et al.  A New Probabilistic Model of Text Classification and Retrieval , 1998 .

[16]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[17]  Kenney Ng A Maximum Likelihood Ratio Information Retrieval Model , 1999, TREC.

[18]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[19]  Alan Gous Adaptive Estimation of Distributions Using Exponential Sub-Families , 1998 .

[20]  K. Sparck Jones,et al.  A Probabilistic Model of Information Retrieval : Development and Status , 1998 .

[21]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.

[22]  Kui-Lam Kwok,et al.  Improving two-stage ad-hoc retrieval for short queries , 1998, SIGIR '98.