Examining topic shifts in content-oriented XML retrieval

Content-oriented XML retrieval systems support access to XML repositories by retrieving, in response to user queries, XML document components (XML elements) instead of whole documents. The retrieved XML elements should not only contain information relevant to the query, but also provide the right level of granularity. In INEX, the INitiative for the Evaluation of XML retrieval, a relevant element is defined to be at the right level of granularity if it is exhaustive and specific to the query. Specificity was specifically introduced to capture how focused an element is on the query (i.e., discusses no other irrelevant topics). To score XML elements according to how exhaustive and specific they are given a query, the content and logical structure of XML documents have been widely used. One source of evidence that has led to promising results with respect to retrieval effectiveness is element length. This work aims at examining a new source of evidence deriving from the semantic decomposition of XML documents. We consider that XML documents can be semantically decomposed through the application of a topic segmentation algorithm. Using the semantic decomposition and the logical structure of XML documents, we propose a new source of evidence, the number of topic shifts in an element, to reflect its relevance and more particularly its specificity. This paper has three research objectives. Firstly, we investigate the characteristics of XML elements reflected by their number of topic shifts. Secondly, we compare topic shifts to element length, by incorporating each of them as a feature in a retrieval setting and examining their effects in estimating the relevance of XML elements given a query. Finally, we use the number of topic shifts as evidence for capturing specificity to provide a focused access to XML repositories.

[1]  Jaana Kekäläinen,et al.  TRIX 2004 - Struggling with the Overlap , 2004, INEX.

[2]  Norbert Fuhr,et al.  Content-oriented XML retrieval with HyRex , 2002, INEX Workshop.

[3]  Ricardo A. Baeza-Yates,et al.  Third edition of the "XML and information retrieval" workshop first workshop on integration of IR and DB (WIRD) jointly held at SIGIR'2004, Sheffield, UK, July 29th, 2004 , 2004, SIGF.

[4]  Andrew Trotman,et al.  Wanted : Element Retrieval Users , 2005 .

[5]  Gabriella Kazai,et al.  Advances in XML Information Retrieval and Evaluation, 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2005, Dagstuhl Castle, Germany, November 28-30, 2005, Revised Selected Papers , 2006, INEX.

[6]  Gabriella Kazai Initiative for the Evaluation of XML Retrieval , 2009 .

[7]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[8]  Jaana Kekäläinen,et al.  Generalized contextualization method for XML information retrieval , 2005, CIKM '05.

[9]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[10]  W. Bruce Croft,et al.  Text Segmentation by Topic , 1997, ECDL.

[11]  Mounia Lalmas,et al.  INEX 2002 - 2006: Understanding XML Retrieval Evaluation , 2007, DELOS.

[12]  CarmelDavid,et al.  XML and information retrieval , 2000 .

[13]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[14]  Mounia Lalmas,et al.  Providing consistent and exhaustive relevance assessments for XML retrieval evaluation , 2004, CIKM '04.

[15]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[16]  Gerhard Weikum,et al.  Intelligent Search on XML Data: Applications, Languages, Models, Implementations, and Benchmarks , 2003 .

[17]  Yosi Mass,et al.  Using the INEX Environment as a Test Bed for Various User Models for XML Retrieval , 2005, INEX.

[18]  Mitchell P. Marcus,et al.  Topic segmentation: algorithms and applications , 1998 .

[19]  Jade Goldstein-Stewart,et al.  Selecting Text Spans for Document Summaries: Heuristics and Metrics , 1999, AAAI/IAAI.

[20]  Christof Monz,et al.  Iterative translation disambiguation for cross-language information retrieval , 2005, SIGIR '05.

[21]  Djoerd Hiemstra,et al.  TIJAH Scratches INEX 2005: Vague Element Selection, Image Search, Overlap, and Relevance Feedback , 2005, INEX.

[22]  Gabriella Kazai INitiative for the Evaluation of XML Retrieval , 2009, Encyclopedia of Database Systems.

[23]  Justin Zobel,et al.  Effective ranking with arbitrary passages , 2001 .

[24]  Arjen P. de Vries,et al.  CWI at INEX 2002 , 2002, INEX Workshop.

[25]  Gabriella Kazai,et al.  Overview of INEX 2005 , 2005, INEX.

[26]  M. de Rijke,et al.  The Importance of Length Normalization for XML Retrieval , 2005, Information Retrieval.

[27]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[28]  Shlomo Geva,et al.  GPX - Gardens Point XML IR at INEX 2006 , 2006, INEX.

[29]  Michael Halliday,et al.  Cohesion in English , 1976 .

[30]  Mounia Lalmas,et al.  Using Topic Shifts in XML Retrieval at INEX 2006 , 2006, INEX.

[31]  Gabriella Kazai,et al.  eXtended cumulated gain measures for the evaluation of content-oriented XML retrieval , 2006, TOIS.

[32]  Thijs Westerveld,et al.  Using Structural Relationships for Focused XML Retrieval , 2006, FQAS.

[33]  Mohand Boughanem,et al.  XFIRM at INEX 2005: Ad-Hoc and Relevance Feedback Tracks , 2005, INEX.

[34]  Yves Chiaramella,et al.  A Model for Multimedia Information Retrieval , 1996 .

[35]  Aya Soffer,et al.  XML and information retrieval: a SIGIR 2000 workshop , 2000, SIGIR 2000.

[36]  James Allan,et al.  Approaches to passage retrieval in full text information systems , 1993, SIGIR.

[37]  James P. Callan,et al.  Hierarchical Language Models for XML Component Retrieval , 2004, INEX.

[38]  Gerard Salton,et al.  Automatic text decomposition using text segments and text themes , 1996, HYPERTEXT '96.

[39]  Gerhard Weikum,et al.  Intelligent Search on XML Data , 2003, Lecture Notes in Computer Science.

[40]  Ross Wilkinson,et al.  Effective retrieval of structured documents , 1994, SIGIR '94.

[41]  I. Papadakis A Digital Library Framework based on XML , 2002 .

[42]  Toshiyuki Amagasa,et al.  Analyzing the Properties of XML Fragments Decomposed from the INEX Document Collection , 2004, INEX.

[43]  M. de Rijke,et al.  Generating and Retrieving Text Segments for Focused Access to Scientific Documents , 2006, ECIR.

[44]  Ricardo A. Baeza-Yates,et al.  Second edition of the "XML and information retrieval" workshop held at SIGIR'2002, Tampere, Finland, Aug 15th, 2002 , 2002, SIGF.

[45]  Djoerd Hiemstra,et al.  A Database Approach to Content-based XML Retrieval , 2002, INEX Workshop.

[46]  Mounia Lalmas,et al.  Using Topic Shifts for Focussed Access to XML Repositories , 2007, ECIR.

[47]  Gabriella Kazai,et al.  Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Schloss Dagstuhl, Germany, December 9-11, 2002 , 2002, INEX.

[48]  Mounia Lalmas,et al.  Advances in XML Information Retrieval, Third International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2004, Dagstuhl Castle, Germany, December 6-8, 2004, Revised Selected Papers , 2005, INEX.

[49]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[50]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..

[51]  Gabriella Kazai,et al.  Report on the ad-hoc track of the INEX 2005 workshop , 2006, SIGF.

[52]  Jaap Kamps,et al.  The Effect of Structured Queries and Selective Indexing on XML Retrieval , 2005, INEX.

[53]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[54]  Mounia Lalmas,et al.  Report on the INEX 2003 Workshop, Schloss Dagstuhl, 15-17 December 2003 , 2004 .

[55]  Andrew Trotman,et al.  Report on the INEX 2005 workshop on element retrieval methodology , 2005, SIGF.

[56]  Börkur Sigurbjörnsson,et al.  Focused information access using XML element retrieval , 2006 .

[57]  Mounia Lalmas,et al.  Report on the INEX 2003 workshop , 2004, SIGF.