A Proximity Probabilistic Model for Information Retrieval

We propose a proximity probabilistic model (PPM) that advances a bag-of-words probabilistic retrieval model. In our proposed model, a document is transformed to a pseudo document, in which a term count is propagated to other nearby terms. Then we consider three heuristics, i.e., the distance of two query term occurrences, their order, and term weights, and try four kernel functions in measuring a positiondependent term count, which can be viewed as a pseudo term frequency. Finally, we integrate term proximity into the probabilistic model BM25 by using the pseudo term frequency to replace term frequency. Experimental results on TREC data sets indicate that the proximity probabilistic model with the reverse kernel function consistently improves the BM25 model by 5% 11%, in terms of Mean Average Precision.