Processing content-oriented XPath queries

Document-centric XML collections contain text-rich documents, marked up with XML tags that add lightweight semantics to the text. Querying such collections calls for a hybrid query language: the text-rich nature of the documents suggests a content-oriented (IR) approach, while the mark-up allows users to add structural constraints to their IR queries. Hybrid queries tend to be more expressive, which should lead---in principle---to better retrieval performance. In practice, the processing of these hybrid queries within an IR systems turns out to be far from trivial, because a delicate balance between structural and content information needs to be sought. We propose an approach to processing such hybrid content-and-structure queries that decomposes a query into multiple content-only queries whose results are then combined in ways determined by the structural constraints of the original query. We evaluate our methods using the INEX 2003 test-suite, and show (1) that effective ways of processing of content-oriented XPath queries are non-trivial, (2) that there are differences in the effectiveness for different topics types, but (3) that with appropriate processing methods retrieval effectiveness can improve.

[1]  Andrew Trotman,et al.  Queries: INEX 2003 working group report , 2004 .

[2]  Nicholas J. Belkin,et al.  Ask for Information Retrieval: Part I. Background and Theory , 1997, J. Documentation.

[3]  Andrew Trotman,et al.  The Simplest Query Language That Could Possibly Work , 2004 .

[4]  Wesley W. Chu,et al.  Configurable indexing and ranking for XML information retrieval , 2004, SIGIR '04.

[5]  M. de Rijke,et al.  Best-match querying from document-centric XML , 2004, WebDB '04.

[6]  Mounia Lalmas,et al.  Modelling Vague Content and Structure Querying in XML Retrieval with a Probabilistic Object-Relational Framework , 2004, FQAS.

[7]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[8]  Stephen E. Robertson,et al.  Effective site finding using link anchor information , 2001, SIGIR '01.

[9]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[10]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[11]  Nicholas J. Belkin,et al.  Ask for Information Retrieval: Part II. Results of a Design Study , 1982, J. Documentation.

[12]  Andrew Trotman,et al.  Searching structured documents , 2004, Inf. Process. Manag..

[13]  Steven J. DeRose,et al.  XML Path Language (XPath) , 1999 .

[14]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[15]  Torsten. Grust,et al.  Accelerating XPath location steps , 2002, SIGMOD '02.

[16]  Torsten Schlieder,et al.  Querying and ranking XML documents , 2002, J. Assoc. Inf. Sci. Technol..

[17]  M. de Rijke,et al.  An Element-based Approach to XML Retrieval , 2004 .

[18]  Acknowledgments , 2006, Molecular and Cellular Endocrinology.

[19]  Ricardo A. Baeza-Yates,et al.  A language for queries on structure and contents of textual databases , 1995, SIGIR '95.

[20]  Maarten de Rijke,et al.  Length normalization in XML retrieval , 2004, SIGIR '04.

[21]  Ross Wilkinson,et al.  Effective retrieval of structured documents , 1994, SIGIR '94.

[22]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[23]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.