Structured queries in XML retrieval

Document-centric XML is a mixture of text and structure. With the increased availability of document-centric XML content comes a need for query facilities in which both structural constraints and constraints on the content of the documents can be expressed. How does the expressiveness of languages for querying XML documents help users to express their information needs? We address this question from both an experimental and a theoretical point of view. Our experimental analysis compares a structure-ignorant with a structure-aware retrieval approach using the test-suite of the 2004 edition of the INEX XML retrieval evaluation initiative. Theoretically, we create mathematical models of users' knowledge of a set of documents and define query languages which exactly fit these models. One of these languages corresponds to an XML version of fielded search, the other to the INEX query language. Our main findings are: First, while structure is used in varying degrees of complexity, over half of the queries can be expressed in a fielded-search like format which does not use the hierarchical structure of the documents. Second, structure is used as a search hint, and not a strict requirement, when judged against the underlying information need. Third, the use of structure in queries functions as a precision enhancing device.

[1]  Georg Gottlob,et al.  Efficient Algorithms for Processing XPath Queries , 2002, VLDB.

[2]  Gabriel M. Kuper,et al.  Structural Properties of XPath Fragments , 2003, ICDT.

[3]  M. de Rijke,et al.  Expressiveness of Concept Expressions in First-Order Description Logics , 1999, Artif. Intell..

[4]  Andrew Trotman,et al.  The Simplest Query Language That Could Possibly Work , 2004 .

[5]  Andrew Trotman,et al.  Queries: INEX 2003 working group report , 2004 .

[6]  M. de Rijke,et al.  Semantic characterizations of navigational XPath , 2005, SGMD.

[7]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[8]  M. de Rijke,et al.  Modal Logic , 2001, Cambridge Tracts in Theoretical Computer Science.

[9]  Wolfgang May Information Extraction and Integration with Florid: The MONDIAL Case Study , 1999 .

[10]  Maarten de Rijke,et al.  Processing content-oriented XPath queries , 2004, CIKM '04.

[11]  Gad M. Landau,et al.  An Extension of the Vector Space Model for Querying XML Documents via XML Fragments 1 , 2002 .

[12]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[13]  Jaap Kamps,et al.  The University of Amsterdam at INEX 2006 , 2002 .

[14]  Jacques Savoy,et al.  Term Proximity Scoring for Keyword-Based Retrieval Systems , 2003, ECIR.

[15]  Joel L Fagan,et al.  Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[16]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[17]  Birger Larsen,et al.  The Interactive Track at INEX 2004 , 2004, INEX.

[18]  Donna K. Harman,et al.  Overview of the First Text REtrieval Conference (TREC-1) , 1992, TREC.

[19]  John Scott What is social network analysis , 2010 .

[20]  S. Wasserman,et al.  Social Network Analysis: Computer Programs , 1994 .

[21]  Andrew Trotman,et al.  Narrowed Extended XPath I (NEXI) , 2004, INEX.

[22]  Jay Ponte,et al.  LANGUAGE MODELS FOR RELEVANCE FEEDBACK , 2002 .