Understanding Content-and-Structure

Document-centric XML is a mixture of text and structure. +With the increased availability of document-centric XML content comes a need for query facilities in which both structural constraints and constraints on the content of the documents can be expressed. This has generated considerable interest in both the IR and DB communities, and has lead to the launch of evaluation efforts tailored for XML documents. One of the driving and long-standing research questions here is: How does the increased expressiveness of languages for querying XML documents help users to better, and more effectively, express their information needs? And closely related to this: How should we evaluate systems that enable users to express their information needs using both content and structural constraints? In this paper we address these research questions. Our analysis follows two lines: What requirements can in principle be expressed in query languages for document-centric XML documents? And: How do users actually use such languages? For the former, we provide mathematical characterizations of two query languages, one for users with next to no knowledge of the document structure (ignorant users), and one for users that have some, but not complete, knowledge of the document structure (semi-ignorant users). To address the latter issue, we examine the topics formulated in the second query language as part of the 2004 edition of the INEX XML retrieval initiative. Our main findings are as follows: First, while structure is used in varying degrees of complexity, over half of the queries can be expressed in the very restrictive ignorant user language. Second, structure is used as a search hint, and not a search requirement, when judged against the underlying information need. Third, the use of structure in queries functions as a precision device. Fourth, the underlying retrieval task of content-and-structure querying is no different from the ordinary natural language query retrieval task. From those findings we derive a number of recommendations for the evaluation of systems that cater for content-and-structure queries.

[1]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[2]  Laks V. S. Lakshmanan,et al.  FleXPath: flexible structure and full-text querying for XML , 2004, SIGMOD '04.

[3]  Donna Harman,et al.  The First Text REtrieval Conference (TREC-1) , 1993 .

[4]  Hinrich Schütze,et al.  Xerox TREC-5 Site Report: Routing, Filtering, NLP, and Spanish Tracks , 1996, TREC.

[5]  Wolfgang May Information Extraction and Integration with Florid: The MONDIAL Case Study , 1999 .

[6]  Andrew Trotman,et al.  The Simplest Query Language That Could Possibly Work , 2004 .

[7]  Andrew Trotman,et al.  Narrowed Extended XPath I (NEXI) , 2004, INEX.

[8]  Gilad Mishne,et al.  Boosting Web Retrieval through Query Operations , 2005, BNAIC.

[9]  Gabriel M. Kuper,et al.  Structural properties of XPath fragments , 2003, Theor. Comput. Sci..

[10]  M. de Rijke,et al.  Expressiveness of Concept Expressions in First-Order Description Logics , 1999, Artif. Intell..

[11]  Richard Spencer-Smith,et al.  Modal Logic , 2007 .

[12]  Andrew Trotman,et al.  Queries: INEX 2003 working group report , 2004 .

[13]  Jacques Savoy,et al.  Term Proximity Scoring for Keyword-Based Retrieval Systems , 2003, ECIR.

[14]  Joel L Fagan,et al.  Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[15]  W. Bruce Croft,et al.  The use of phrases and structured queries in information retrieval , 1991, SIGIR '91.

[16]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[17]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[18]  Ricardo A. Baeza-Yates,et al.  A language for queries on structure and contents of textual databases , 1995, SIGIR '95.

[19]  Maarten de Rijke,et al.  Processing content-oriented XPath queries , 2004, CIKM '04.

[20]  Gad M. Landau,et al.  An Extension of the Vector Space Model for Querying XML Documents via XML Fragments 1 , 2002 .

[21]  David Hawking,et al.  Relevance weighting using distance between term occurrences , 1996 .

[22]  Donna K. Harman,et al.  Overview of the First Text REtrieval Conference (TREC-1) , 1992, TREC.

[23]  Norbert Fuhr,et al.  XIRQL: An XML query language based on information retrieval concepts , 2004, TOIS.

[24]  Jaap Kamps,et al.  The University of Amsterdam at INEX 2006 , 2002 .

[25]  Maarten de Rijke,et al.  Length normalization in XML retrieval , 2004, SIGIR '04.

[26]  E. Michael Keen,et al.  Term position ranking: some new test results , 1992, SIGIR '92.

[27]  John Scott What is social network analysis , 2010 .

[28]  Charles L. A. Clarke,et al.  INEX 2006 retrieval task and result submission specification , 2006 .