BM25 Beyond Query-Document Similarity

The massive growth of information produced and shared online has made retrieving relevant documents a difficult task. Query Expansion (QE) based on term co-occurrence statistics has been widely applied in an attempt to improve retrieval effectiveness. However, selecting good expansion terms using co-occurrence graphs is challenging. In this paper, we present an adapted version of the BM25 model, which allows measuring the similarity between terms. First, a context window-based approach is applied over the entire corpus in order to construct the term co-occurrence graph. Afterward, using the proposed adapted version of BM25, candidate expansion terms are selected according to their similarity with the whole query. This measure stands out by its ability to evaluate the discriminative power of terms and select semantically related terms to the query. Experiments on two ad-hoc TREC collections (the standard Robust04 collection and the new TREC Washington Post collection) show that our proposal outperforms the baselines over three state-of-the-art IR models and leads to significant improvements in retrieval effectiveness.

[1]  Narjès Bellamine Ben Saoud,et al.  Combining Semantic Query Disambiguation and Expansion to Improve Intelligent Information Retrieval , 2014, ICAART.

[2]  Ben He,et al.  Modeling term proximity for probabilistic information retrieval models , 2011, Inf. Sci..

[3]  ChengXiang Zhai,et al.  Lower-bounding term frequency normalization , 2011, CIKM '11.

[4]  W. Bruce Croft,et al.  Quary Expansion Using Local and Global Document Analysis , 1996, SIGIR Forum.

[5]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[6]  Ibrahim Bounhas,et al.  Pseudo-Relevance Feedback Based on Locally-Built Co-occurrence Graphs , 2019, ADBIS.

[7]  Azadeh Shakery,et al.  Pseudo-Relevance Feedback Based on Matrix Factorization , 2016, CIKM.

[8]  Berthier A. Ribeiro-Neto,et al.  Concept-based interactive query expansion , 2005, CIKM '05.

[9]  Ibrahim Bounhas,et al.  Query Expansion Based on NLP and Word Embeddings , 2018, TREC.

[10]  Jean-Pierre Chevallet,et al.  A Comparison of Deep Learning Based Query Expansion with Pseudo-Relevance Feedback and Mutual Information , 2016, ECIR.

[11]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[12]  John D. Lafferty,et al.  A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval , 2017, SIGF.

[13]  Gianni Amati,et al.  Probability models for information retrieval based on divergence from randomness , 2003 .

[14]  Cherif Chiraz Latiri,et al.  Short Query Expansion for Microblog Retrieval , 2016, KES.

[15]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[16]  W. Bruce Croft,et al.  Improving the effectiveness of information retrieval with local context analysis , 2000, TOIS.

[17]  W. Bruce Croft,et al.  Relevance-based Word Embedding , 2017, SIGIR.

[18]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[19]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[20]  Yong Yu,et al.  Viewing Term Proximity from a Different Perspective , 2008, ECIR.

[21]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[22]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[23]  Azadeh Shakery,et al.  Improving Retrieval Performance for Verbose Queries via Axiomatic Analysis of Term Discrimination Heuristic , 2017, SIGIR.

[24]  Javier Parapar,et al.  LiMe: linear methods for pseudo-relevance feedback , 2018, SAC.

[25]  Jacques Savoy,et al.  Term Proximity Scoring for Keyword-Based Retrieval Systems , 2003, ECIR.

[26]  Ibrahim Bounhas,et al.  ArabOnto: experimenting a new distributional approach for building Arabic ontological resources , 2011, Int. J. Metadata Semant. Ontologies.

[27]  Joel L. Fagan,et al.  Automatic Phrase Indexing for Document Retrieval: An Examination of Syntactic and Non-Syntactic Methods , 1987, SIGIR.

[28]  Jian-Yun Nie,et al.  Query expansion using term relationships in language models for information retrieval , 2005, CIKM '05.

[29]  Narjès Bellamine Ben Saoud,et al.  A comparative study between possibilistic and probabilistic approaches for monolingual word sense disambiguation , 2014, Knowledge and Information Systems.

[30]  Peter Willett,et al.  The Limitations of Term Co-Occurrence Data for Query Expansion in Document Retrieval Systems , 1991 .

[31]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[32]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[33]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..