论文信息 - Probabilistic Document Length Priors for Language Models

Probabilistic Document Length Priors for Language Models

This paper addresses the issue of devising a new document prior for the language modeling (LM) approach for Information Retrieval. The prior is based on term statistics, derived in a probabilistic fashion and portrays a novel way of considering document length. Furthermore, we developed a new way of combining document length priors with the query likelihood estimation based on the risk of accepting the latter as a score. This prior has been combined with a document retrieval language model that uses Jelinek-Mercer (JM), a smoothing technique which does not take into account document length. The combination of the prior boosts the retrieval performance, so that it outperforms a LM with a document length dependent smoothing component (Dirichlet prior) and other state of the art high-performing scoring function (BM25). Improvements are significant, robust across different collections and query sizes.

Roi Blanco | Alvaro Barreiro

[1] José Luis Vicedo González,et al. TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[2] ChengXiang Zhai,et al. Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[3] Stephen E. Robertson,et al. Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[4] Ellen M. Voorhees,et al. Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[5] Stephen E. Robertson,et al. Okapi at TREC-3 , 1994, TREC.

[6] Stephen E. Robertson,et al. GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[7] Iadh Ounis,et al. A study of parameter tuning for term frequency normalization , 2003, CIKM '03.

[8] Alexander Dekhtyar,et al. Information Retrieval , 2018, Lecture Notes in Computer Science.

[9] Stephen E. Robertson,et al. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[10] Ellen M. Voorhees,et al. TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[11] CHENGXIANG ZHAI,et al. A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[12] Djoerd Hiemstra,et al. The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[13] Stephen E. Robertson,et al. Relevance weighting for query independent evidence , 2005, SIGIR '05.

[14] Djoerd Hiemstra,et al. Retrieving Web Pages Using Content, Links, URLs and Anchors , 2001, TREC.

[15] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[16] David Hawking,et al. Query-independent evidence in home page finding , 2003, TOIS.

[17] Chris Buckley,et al. Pivoted Document Length Normalization , 1996, SIGIR Forum.

[18] W. Bruce Croft,et al. Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.