INEX 2005 Workshop on Element Retrieval Methodology

Document-centric XML is a mixture of text and structure. With the increased availability of document-centric XML content comes a need for query facilities in which both structural constraints and constraints on the content of the documents can be expressed. This has generated considerable interest in both the IR and DB communities, and has lead to the launch of evaluation efforts tailored for XML documents. One of the driving and long-standing research questions here is: How does the increased expressiveness of languages for querying XML documents help users to better, and more effectively, express their information needs? And closely related to this: How should we evaluate systems that enable users to express their information needs using both content and structural constraints? In this paper we address these research questions. Our analysis follows two lines: What requirements can in principle be expressed in query languages for document-centric XML documents? And: How do users actually use such languages? For the former, we provide mathematical characterizations of two query languages, one for users with next to no knowledge of the document structure (ignorant users), and one for users that have some, but not complete, knowledge of the document structure (semi-ignorant users). To address the latter issue, we examine the topics formulated in the second query language as part of the 2004 edition of the INEX XML retrieval initiative. Our main findings are as follows: First, while structure is used in varying degrees of complexity, over half of the queries can be expressed in the very restrictive ignorant user language. Second, structure is used as a search hint, and not a search requirement, when judged against the underlying information need. Third, the use of structure in queries functions as a precision device. Fourth, the underlying retrieval task of content-and-structure querying is no different from the ordinary natural language query retrieval task. From those findings we derive a number of recommendations for the evaluation of systems that cater for content-and-structure queries.

[1]  Donna K. Harman,et al.  Overview of the First Text REtrieval Conference (TREC-1) , 1992, TREC.

[2]  Ricardo A. Baeza-Yates,et al.  A language for queries on structure and contents of textual databases , 1995, SIGIR '95.

[3]  Hinrich Schütze,et al.  Xerox TREC-5 Site Report: Routing, Filtering, NLP, and Spanish Tracks , 1996, TREC.

[4]  Jacques Savoy,et al.  Term Proximity Scoring for Keyword-Based Retrieval Systems , 2003, ECIR.

[5]  M. de Rijke,et al.  Expressiveness of Concept Expressions in First-Order Description Logics , 1999, Artif. Intell..

[6]  S. Wasserman,et al.  Social Network Analysis: Computer Programs , 1994 .

[7]  Maarten de Rijke,et al.  Processing content-oriented XPath queries , 2004, CIKM '04.

[8]  Charles L. A. Clarke,et al.  INEX 2006 retrieval task and result submission specification , 2006 .

[9]  W. Bruce Croft,et al.  The use of phrases and structured queries in information retrieval , 1991, SIGIR '91.

[10]  Gabriel M. Kuper,et al.  Structural Properties of XPath Fragments , 2003, ICDT.

[11]  Andrew Trotman,et al.  The Simplest Query Language That Could Possibly Work , 2004 .

[12]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[13]  Gad M. Landau,et al.  An Extension of the Vector Space Model for Querying XML Documents via XML Fragments 1 , 2002 .

[14]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[15]  Joel L Fagan,et al.  Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[16]  Andrew Trotman,et al.  Queries: INEX 2003 working group report , 2004 .

[17]  E. Michael Keen,et al.  Term position ranking: some new test results , 1992, SIGIR '92.

[18]  Wolfgang May Information Extraction and Integration with Florid: The MONDIAL Case Study , 1999 .

[19]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[20]  Jaap Kamps,et al.  The University of Amsterdam at INEX 2006 , 2002 .

[21]  Andrew Trotman,et al.  Narrowed Extended XPath I (NEXI) , 2004, INEX.

[22]  Norbert Fuhr,et al.  XIRQL: An XML query language based on information retrieval concepts , 2004, TOIS.

[23]  Gilad Mishne,et al.  Boosting Web Retrieval through Query Operations , 2005, BNAIC.

[24]  M. de Rijke,et al.  Modal Logic , 2001, Cambridge Tracts in Theoretical Computer Science.

[25]  Maarten de Rijke,et al.  Length normalization in XML retrieval , 2004, SIGIR '04.

[26]  David Hawking,et al.  Relevance weighting using distance between term occurrences , 1996 .

[27]  Laks V. S. Lakshmanan,et al.  FleXPath: flexible structure and full-text querying for XML , 2004, SIGMOD '04.