Querying and ranking XML documents

XML represents both content and structure of documents. Taking advantage of the document structure promises to greatly improve the retrieval precision. In this article, we present a retrieval technique that adopts the similarity measure of the vector space model, incorporates the document structure, and supports structured queries. Our query model is based on tree matching as a simple and elegant means to formulate queries without knowing the exact structure of the data. Using this query model we propose a logical document concept by deciding on the document boundaries at query time. We combine structured queries and term-based ranking by extending the term concept to structural terms that include substructures of queries and documents. The notions of term frequency and inverse document frequency are adapted to logical documents and structural terms. We introduce an efficient technique to calculate all necessary term frequencies and inverse document frequencies at query time. By adjusting parameters of the retrieval process we are able to model two contrary approaches: the classical vector space model, and the original tree matching approach.

[1]  Gerhard Weikum,et al.  Adding Relevance to XML , 2000, WebDB.

[2]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[3]  Torsten Schlieder Similarity Search in XML Data using Cost-Based Query Transformations , 2001, WebDB.

[4]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[5]  Norbert Fuhr,et al.  DOLORES: a system for logic-based retrieval of multimedia objects , 1998, SIGIR '98.

[6]  Felix Naumann,et al.  Approximate tree embedding for querying XML data , 2000 .

[7]  Ricardo A. Baeza-Yates,et al.  Integrating contents and structure in text retrieval , 1996, SGMD.

[8]  I. V. Ramakrishnan,et al.  Nonlinear pattern matching in trees , 1988, JACM.

[9]  Alin Deutsch,et al.  XML-QL: A Query Language for XML , 1998 .

[10]  Stefano Ceri,et al.  Comparative analysis of five XML query languages , 1999, SGMD.

[11]  Ricardo A. Baeza-Yates,et al.  A language for queries on structure and contents of textual databases , 1995, SIGIR '95.

[12]  Nicholas Kushmerick,et al.  Expressive and Efficient Ranked Querying of XML data , 2001, WebDB.

[13]  Gloria Bordogna,et al.  Extended Boolean Information Retrieval in Terms of Fuzzy Inclusion , 2000 .

[14]  CeriStefano,et al.  Comparative analysis of five XML query languages , 2000 .

[15]  Michael Fuller,et al.  Structured answers for a large structured document collection , 1993, SIGIR.

[16]  Mounia Lalmas,et al.  Dempster-Shafer's theory of evidence applied to structured documents: modelling uncertainty , 1997, SIGIR '97.

[17]  N. Fuhr An Extension of XQL for Information Retrieval , 2000 .

[18]  Alin Deutsch,et al.  A Query Language for XML , 1999, Comput. Networks.

[19]  Pekka Kilpeläinen,et al.  Tree Matching Problems with Applications to Structured Text Databases , 2022 .

[20]  Ricardo A. Baeza-Yates,et al.  A model and a visual query language for structured text , 1998, Proceedings. String Processing and Information Retrieval: A South American Symposium (Cat. No.98EX207).