StatBM25: An Aggregative and Statistical Approach for Document Ranking

In Information Retrieval and Web Search, BM25 is one of the most influential probabilistic retrieval formulas for document weighting and ranking. BM25 involves three parameters $k_1$, $k_3$ and b, which provide scalar approximation and scaling of important document features such asterm frequency, document frequency, anddocument length. We investigate in this paper aggregative and statistical document features for document ranking. Shortly speaking, a statistically adjusted BM25 is used to score in an aggregative way onvirtual documents, which are generated by randomly combining documents from the original collection. The problem size, in the number of virtual documents to be ranked, is an expansion to the problem size of the original problem. As a result, ranking is actually realized through performing statistical sampling. Rejection Sampling, a simple Monte Carlo sampling method is used at present. This new framework is called StatBM25, in emphasizing first the fact that the original problem domain space is K-expanded (a concept to be further explained in the paper); Further, statistical sampling is employed in the model. Empirical studies are performed on several standard test collections, where StatBM25 demonstrates convincingly high degree of both uniqueness and effectiveness compared to BM25. This means, in our belief, that StatBM25 as a statistically smoothed and normalized variant to BM25, might eventually lead to discoveries of useful new statistic measures for document ranking.

[1]  Stephen E. Robertson,et al.  Probabilistic models in IR and their relationships , 2014, Information Retrieval.

[2]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[3]  Stephen E. Robertson,et al.  On Event Spaces and Probabilistic Models in Information Retrieval , 2005, Information Retrieval.

[4]  Xing Tan,et al.  Ranking Documents Through Stochastic Sampling on Bayesian Network-based Models: A Pilot Study , 2016, SIGIR.

[5]  Xing Tan,et al.  On the Effectiveness of Bayesian Network-based Models for Document Ranking , 2017, ICTIR.

[6]  Charles L. A. Clarke,et al.  Information Retrieval - Implementing and Evaluating Search Engines , 2010 .

[7]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature , 1975, J. Am. Soc. Inf. Sci..

[8]  Xiangji Huang,et al.  Modeling Term Associations for Probabilistic Information Retrieval , 2014, TOIS.

[9]  Javed A. Aslam,et al.  Condorcet fusion for improved retrieval , 2002, CIKM '02.

[10]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[11]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[12]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[13]  Ben He,et al.  CRTER: using cross terms to enhance probabilistic information retrieval , 2011, SIGIR '11.

[14]  Garrison W. Cottrell,et al.  Fusion Via a Linear Combination of Scores , 1999, Information Retrieval.

[15]  Andrew Trotman,et al.  Improvements to BM25 and Language Models Examined , 2014, ADCS.