Integrating Structure in the Probabilistic Model for Information Retrieval

In databases or in the World Wide Web, many documents are in a structured format (e.g. XML). We propose in this article to extend the classical IR probabilistic model in order to take into account the structure through the weighting of tags. Our approach includes a learning step in which the weight of each tag is computed. This weight estimates the probability that the tag distinguishes the terms which are the most relevant. Our model has been evaluated on a large collection during INEX IR evaluation campaigns.

[1]  Joaquin Rapela Automatically combining ranking heuristics for HTML documents , 2001, WIDM '01.

[2]  David Konopnicki,et al.  Information gathering in the World-Wide Web: the W3QL query language and the W3QS system , 1998, TODS.

[3]  Mohand Boughanem,et al.  XFIRM at INEX 2005: Ad-Hoc and Relevance Feedback Tracks , 2005, INEX.

[4]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[5]  M. de Rijke,et al.  Structured queries in XML retrieval , 2005, CIKM '05.

[6]  Ross Wilkinson,et al.  Effective retrieval of structured documents , 1994, SIGIR '94.

[7]  Stephen E. Robertson,et al.  Field-Weighted XML Retrieval Based on BM25 , 2005, INEX.

[8]  Ricardo A. Baeza-Yates,et al.  A language for queries on structure and contents of textual databases , 1995, SIGIR '95.

[9]  Andrew Trotman,et al.  Focused Access to XML Documents, 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2007, Dagstuhl Castle, Germany, December 17-19, 2007. Selected Papers , 2008, INEX.

[10]  Donna Harman,et al.  Information Processing and Management , 2022 .

[11]  Ludovic Denoyer,et al.  The Wikipedia XML Corpus , 2006, INEX.

[12]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[13]  Michael Fuller,et al.  Structured answers for a large structured document collection , 1993, SIGIR.

[14]  Dayne Freitag,et al.  A Machine Learning Architecture for Optimizing Web Search Engines , 1999 .

[15]  Andrew Trotman,et al.  Choosing document structure weights , 2005, Inf. Process. Manag..

[16]  Torsten Schlieder,et al.  Querying and ranking XML documents , 2002, J. Assoc. Inf. Sci. Technol..

[17]  N. Fuhr An Extension of XQL for Information Retrieval , 2000 .

[18]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[19]  Evangelos Kotsakis,et al.  Structured information retrieval in XML documents , 2002, SAC '02.

[20]  Ludovic Denoyer,et al.  The XML Wikipedia Corpus , 2006 .

[21]  Byoung-Tak Zhang,et al.  SCAI TREC-8 Experiments , 1999, TREC.

[22]  Gabriella Kazai,et al.  INEX 2007 Evaluation Measures , 2008, INEX.

[23]  David Konopnicki,et al.  W3QS: A Query System for the World-Wide Web , 1995, VLDB.

[24]  Mathias Géry,et al.  UJM at INEX 2007: Document Model Integrating XML Tags , 2008, INEX.

[25]  Gerald J. Kowalski,et al.  Information Retrieval Systems , 1997, The Information Retrieval Series.

[26]  Michael Fuller,et al.  Coherent Answers for a Large Structured Document Collection. , 1993, SIGIR 1993.

[27]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[28]  Armin B. Cremers,et al.  Searching and browsing collections of structural information , 2000, Proceedings IEEE Advances in Digital Libraries 2000.