Simple BM25 extension to multiple weighted fields

This paper describes a simple way of adapting the BM25 ranking formula to deal with structured documents. In the past it has been common to compute scores for the individual fields (e.g. title and body) independently and then combine these scores (typically linearly) to arrive at a final score for the document. We highlight how this approach can lead to poor performance by breaking the carefully constructed non-linear saturation of term frequency in the BM25 function. We propose a much more intuitive alternative which weights term frequencies <i>before</i> the non-linear term frequency saturation function is applied. In this scheme, a structured document with a title weight of two is mapped to an unstructured document with the title content repeated twice. This more verbose unstructured document is then ranked in the usual way. We demonstrate the advantages of this method with experiments on Reuters Vol1 and the TREC dotGov collection.

[1]  Benjamin Piwowarski,et al.  A Machine Learning Model for Information Retrieval with Structured Documents , 2003, MLDM.

[2]  Yue Liu,et al.  TREC-10 Experiments at CAS-ICT: Filtering, Web and QA , 2001, TREC.

[3]  Wensheng Wu,et al.  UIC at TREC-2002: Web Track (Draft) , 2002 .

[4]  Ophir Frieder,et al.  IIT at TREC 2002 Linear Combinations Based on Document Structure and Varied Stemming for Arabic Retrieval , 2002, TREC.

[5]  Ross Wilkinson,et al.  Effective retrieval of structured documents , 1994, SIGIR '94.

[6]  Mounia Lalmas Uniform Representation of Content and Structure for structured document retrieval , 2001 .

[7]  David Carmel,et al.  Topic Distillation with Knowledge Agents , 2002, TREC.

[8]  Jacques Savoy,et al.  Report on the TREC 11 Experiment: Arabic, Named Page and Topic Distillation Searches , 2002, TREC.

[9]  Evangelos Kotsakis,et al.  Structured information retrieval in XML documents , 2002, SAC '02.

[10]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[11]  Mohand Boughanem,et al.  IRIT at TREC 2002: Web Track , 2002, TREC.

[12]  Sung-Hyon Myaeng,et al.  A flexible model for retrieval of SGML documents , 1998, SIGIR '98.

[13]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[14]  David Hawking,et al.  TREC 12 Web Track at CSIRO , 2003 .

[15]  Junyu Niu,et al.  FDU at TREC 2002: Filtering, Q&A, Web and Video Tasks , 2002, TREC.

[16]  James P. Callan,et al.  Combining document representations for known-item search , 2003, SIGIR.

[17]  Clement T. Yu,et al.  UIC at TREC 2002: Web Track , 2002, TREC.

[18]  Dong-Hong Ji,et al.  LIT at TREC 2002: Web Track , 2002, TREC.

[19]  Bin Liu,et al.  TREC 11 Experiments at CAS-ICT: Filtering and Web , 2002, TREC.

[20]  David Hawking,et al.  Overview of the TREC-2002 Web Track , 2002, TREC.