The smoothed dirichlet distribution: understanding cross-entropy ranking in information retrieval

Unigram Language modeling is a successful probabilistic framework for Information Retrieval (IR) that uses the multinomial distribution to model documents and queries. An important feature in this approach is the usage of the empirically successful cross-entropy function between the query model and document models as a document ranking function. However, this function does not follow directly from the underlying models and as such there is no justification available for its usage till date. Another related and interesting observation is that the naive Bayes model for text classification uses the same multinomial distribution to model documents but in contrast, employs document-log-likelihood that follows directly from the model, as a scoring function. Curiously, the document-log-likelihood closely corresponds to cross entropy, but to an asymmetric counterpart of the function used in language modeling. It has been empirically demonstrated that the version of cross entropy used in IR is a better performer than document-log-likelihood, but this interesting phenomenon remains largely unexplained. One of the main objectives of this work is to develop a theoretical understanding of the reasons for the success of the version of cross entropy function used for ranking in IR. We also aim to construct a likelihood based generative model that directly corresponds to this cross-entropy function. Such a model, if successful, would allow us to view IR essentially as a machine learning problem. A secondary objective is to bridge the gap between the generative approaches used in IR and text classification through a unified model. In this work we show that the cross entropy ranking function corresponds to the log-likelihood of documents w.r.t. the approximate Smoothed-Dirichlet (SD) distribution, a novel variant of the Dirichlet distribution. We also empirically demonstrate that this new distribution captures term occurrence patterns in documents much better than the multinomial, thus offering a reason behind the superior performance of the cross entropy ranking function compared to the multinomial document-likelihood. Our experiments in text classification show that a classifier based on the Smoothed Dirichlet performs significantly better than the multinomial based naive Bayes model and on par with the Support Vector Machines (SVM), confirming our reasoning. In addition, this classifier is as quick to train as the naive Bayes and several times faster than the SVMs owing to its closed form maximum likelihood solution, making it ideal for many practical IR applications. We also construct a well-motivated generative classifier for IR based on SD distribution that uses the EM algorithm to learn from pseudo-feedback and show that its performance is equivalent to the Relevance model (RM), a state-of-the-art model for IR in the language modeling framework that uses the same cross-entropy as its ranking function. In addition, the SD based classifier provides more flexibility than RM in modeling documents owing to a consistent generative framework. We demonstrate that this flexibility translates into a superior performance compared to RM on the task of topic tracking, an online classification task.

[1]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[2]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[3]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[4]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[5]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Stephen E. Robertson,et al.  Probabilistic models of indexing and searching , 1980, SIGIR '80.

[8]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[9]  Robert N. Oddy,et al.  Information Retrieval Research , 1982 .

[10]  William S. Cooper,et al.  Exploiting the maximum entropy principle to increase retrieval effectiveness , 1983, J. Am. Soc. Inf. Sci..

[11]  Paul B. Kantor,et al.  The maximum entropy principle in information retrieval , 1986, SIGIR '86.

[12]  Michael D. Gordon,et al.  A utility theoretic examination of the probability ranking principle in information retrieval , 1991, J. Am. Soc. Inf. Sci..

[13]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[14]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[15]  Fredric C. Gey,et al.  Inferring probability of relevance using the method of logistic regression , 1994, SIGIR '94.

[16]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[17]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[18]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[19]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[20]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[21]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[22]  Paul B. Kantor,et al.  Testing the Maximum Entropy Principle for Information Retrieval , 1998, J. Am. Soc. Inf. Sci..

[23]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[24]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[25]  Stephen E. Robertson,et al.  The TREC-8 Filtering Track Final Report , 1999, TREC.

[26]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[27]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[28]  Stephen E. Robertson,et al.  Threshold setting in adaptive filtering , 2000, J. Documentation.

[29]  Jaime B. Teevan,et al.  Improving Information Retrieval with Textual Analysis: Bayesian Models and Beyond , 2001 .

[30]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[31]  Yi Zhang,et al.  Maximum likelihood estimation for filtering thresholds , 2001, SIGIR '01.

[32]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[33]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[34]  David E. Losada,et al.  A homogeneous framework to model relevance feedback , 2001, SIGIR '01.

[35]  James Allan,et al.  Capturing term dependencies using a language model based on sentence trees , 2002, CIKM '02.

[36]  Jean-Luc Gauvain,et al.  THE LIMSI TOPIC TRACKING SYSTEM FOR TDT2002 , 2002 .

[37]  James Allan,et al.  Relevance models for topic detection and tracking , 2002 .

[38]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[39]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[40]  Yiming Yang,et al.  Margin-based local regression for adaptive filtering , 2003, CIKM '03.

[41]  Djoerd Hiemstra,et al.  Bayesian extension to the language model for ad hoc information retrieval , 2003, SIGIR.

[42]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[43]  David R. Karger,et al.  Empirical development of an exponential probabilistic model for text retrieval: using textual analysis to build a better model , 2003, SIGIR.

[44]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[45]  Ramesh Nallapati,et al.  Discriminative models for information retrieval , 2004, SIGIR '04.

[46]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[47]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[48]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[49]  Xi Chen,et al.  Text classification with kernels on the multinomial manifold , 2005, SIGIR '05.

[50]  Tommi S. Jaakkola,et al.  Using term informativeness for named entity detection , 2005, SIGIR '05.

[51]  John D. Lafferty,et al.  Diffusion Kernels on Statistical Manifolds , 2005, J. Mach. Learn. Res..

[52]  David Kauchak,et al.  Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[53]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[54]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[55]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[56]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.